How Dropbox leverages testing to maintain high level of trust at scale

This is part 2 of the Testing at scale series of articles where we asked industry experts to share their testing strategies. In this article, Ryan Harter, Staff Engineer at Dropbox, shares how the shape of Dropbox’s testing pyramid changed over time, and what tools they use to get timely feedback.

With more than one billion downloads, the Dropbox app for Android has to maintain a high quality bar for a diverse set of use cases and users. With less than 30 Android engineers, manual testing and #yolo isn’t enough to maintain confidence in our codebase, so we employ a variety of different testing strategies to ensure we can continually serve our users needs.

Since Dropbox makes it easy to access your files across all of your devices, the Android app has to support viewing as many of those files as possible, including media files, documents, photos, and all of the variations within these categories. Additionally, features like Camera Uploads, which automatically backs up all of your most important photos, require deep integration with the Android OS in ways that have changed significantly over the years and across Android versions. All of this needs to continually work for our users, without them having to worry about the complexity, because the last thing anyone wants is to worry that they might lose their data.

While the size and distribution of the Android team at Dropbox has changed throughout the years, it’s imperative that we’re able to consistently build and refine features within the app while maintaining the level of trust from our users that we’ve become known for. To help underscore how Dropbox has been able to foster that trust, I’d like to share some ways that our testing strategies have changed over the years.

How it started

While automated testing has always been an important part of engineering culture at Dropbox, it hasn’t always been easy on Android. Years ago Dropbox invested in testing infrastructure that leaned heavily on End-to-End (E2E) testing. Built on Android’s instrumentation tests, we developed test helpers for features in the app following the test robot pattern. This enabled a large suite of tests to be created that could simulate a user moving throughout the app, but came with its own significant costs.

Like many Android projects at the time, the Dropbox app started out as a monolithic app module, but that wasn’t sustainable in the long run. Work was done to decompose the monolith into a more modular architecture, but the E2E test suite wasn’t prioritized in this effort due to the complex interplay of dependencies. This left our E2E test suite as a monolith of its own, resulting in test code that didn’t exist alongside the feature code it exercised, allowing them to easily be overlooked and become outdated.

Additionally, the long build times that come with monolithic modules with many dependencies mixed with the tests being executed on emulators in our custom continuous integration (CI) environment meant that the feedback cycle for these E2E tests was slow. This resulted in engineers feeling incentivised to remove failing tests instead of updating them.

As the Android ecosystem embraced automated testing more and more, with the introduction of helpful libraries like Espresso, Robolectric, and support for unit testing built directly into Gradle, Dropbox kept up with these changes by moving from the heavy reliance on E2E tests towards more and more unit tests, filling out the bottom layer of the previously inverted testing pyramid. This was a significant win for test coverage within the app, and allowed us to roll out quality assurance practices like code coverage baselines, to ensure that we continually improved the reliability of the product as it moved forward.

Over time, as unit testing became easier and easier and engineers became more and more frustrated with the slow feedback cycles of E2E tests, our testing pyramid became lopsided in the other direction. We had confidence in our unit tests and the infrastructure supporting them, but our E2E tests aged without much support, becoming more and more unreliable, to the point that we mostly ignored their failures. Tests that can’t be trusted end up becoming a maintenance burden and provide little value, so we recognized that something needed to change.

How it’s going

Over the past year we’ve doubled down on our focus on reliability. We’ve invested in our test infrastructure to ensure that engineers are not only able to, but incentivised to write valuable tests across all layers of the testing pyramid. In addition to technical investment in code and tooling, that has also required that we take the time to evaluate the things we test, and how we test them, and make sure the entire team has a better understanding of which tools to use when.

Unit testing

We continue to spend most of our efforts writing unit tests. These are fast, focused tests that provide quick feedback, and serve as our first line of defense against regressions. We write JUnit tests whenever we can, and fall back to instrumentation tests when we need to. Robolectric’s interoperability with AndroidX Test has allowed us to move many of our instrumentation tests to JVM-based unit tests, making it even easier to meet our test coverage goals.

Speaking of test coverage goals, the unit testing layer is the only layer that we use to determine our code coverage. By default we target 80% test coverage, though we have a process to override this target for circumstances in which unit testing is either not valuable, or infeasible.

Note: While we use standard JaCoCo tooling to evaluate our test coverage, its lack of deep understanding of Kotlin presents some challenges. For instance, we haven’t yet found a way to inform JaCoCo that the generated accessors, toString and hashcode of behaviorless data classes don’t require test coverage. We’ve been experimenting and considering alternatives to ensure that we’re not writing brittle tests that don’t provide value, but for now we are stuck with issuing coverage overrides for these cases.

E2E testing

Over the past several months we’ve been renewing investment in our automated E2E test suite. This test suite is able to alert us to extremely important issues that unit tests simply can’t identify, like OS integration issues or unexpected API responses. Therefore we’ve worked hard to improve our infrastructure to make tests easier for engineers to run locally, we’ve audited and removed flaky or invalid tests, and worked on documentation and training to ensure that we support our engineers in the creation and maintenance of our E2E test suite.

Change in E2E test counts before and after test suite improvement effort.

As I mentioned above, our E2E tests simulate a user moving throughout the app. This means that the task of defining our E2E test cases is more than simply an engineering problem. Therefore, we developed guidance to help engineers work with product and design partners to define test cases that represent true use cases.

We recently introduced a practice of using a proper Definition of Done for development work. This amounts to a checklist of items that must be completed in order for a project to be considered “done”, which is defined and agreed upon at the beginning of the project. Our standard checklist includes the declaration of E2E test cases for the project, which ensures that we are adding test cases in a thoughtful manner, taking into account the value and purpose of those tests, instead of targeting arbitrary coverage numbers.

Screenshot testing

Another dimension of our tests that we’ve ramped up in recent years is screenshot testing. Screenshot tests allow us to validate against visual regressions, ensuring that views render properly in light and dark mode, different orientations, and different form factors.

In unit tests we leverage Paparazzi for screenshot testing. This allows us to write fast, isolated tests and we find it’s best suited for testing individual view or composable layouts, including our design system components.

We also find value executing screenshot tests in more full featured instrumentation tests. For this, we use our own Dropshots library, which supports screenshot testing on devices and emulators. Since Dropshots executes screenshot tests on real (or emulated) devices, it is a great way to validate system integrations like edge-to-edge display, the default window mode on Android 15 devices.

Manual testing

With all of the investment we’ve made into automated testing you’d be forgiven for thinking that we do no manual testing, but even today that’s simply not feasible. There are many workflows for which automated tests would either be too hard to write, or too hard to validate. For example, we have both unit and E2E tests to validate that the app behaves appropriately when rendering file content, but it can be hard to programmatically validate file content, and screenshot tests can sometimes prove too flaky.

For these cases, we use a web based test case management tool to maintain a complete set of manual test cases, and a third party testing service to execute the tests prior to each release. This allows us to catch issues for which we haven’t yet written tests, or which require human judgement.

Looking forward

Testing has proven invaluable in identifying quality issues before they make it to users, allowing us to earn our customer’s trust. Given that value, we intend to continue investing in testing to ensure that we can continue to maintain high quality and reliability. There are a few things that we’re looking forward to in the future.

I’m currently in the process of expanding the functionality of Dropshots to support multiple device configurations, which will allow us to perform screenshot tests across a broad range of devices with a single set of tests. Since the Dropbox app works across many different form factors, it will be valuable for us to simultaneously run our screenshot test suite on a variety of devices or emulators to prevent regressions on less common form factors.

Additionally, we’re beginning to experiment with Compose Preview Screenshot Testing, which allows our Compose Preview functions to serve double duty by speeding up development cycles while also being used to protect against regressions.

Finally, we intend to continue ensuring that we have a good balance of the right kinds of tests. Balancing our testing pyramid to ensure that our investment in testing serves our reliability goals instead of chasing arbitrary coverage targets. We’ve already seen the value that a healthy test suite can provide, and we’ll continue investing in this area to ensure that we continue to be worthy of trust.

How Dropbox leverages testing to maintain high level of trust at scale was originally published in Android Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.