Building Mobile Apps At Scale

Abílio Azevedo

November 13, 2023

Here is my attempt at translating the second part of the text into English:

When the problem is simple to solve, simple solutions suffice. However, when problems become complex, scalable solutions are needed. There is no silver bullet but technical choices need to be negotiated.

PART 1: Challenges Due to the Nature of Mobile Apps

Mobile development has several peculiarities and unique challenges, below are some of them:

1. State Management

The network can fail and cause crashes and corrupted states
State caching

2. Mistakes are hard to revert

App store version updates can be slow

3. The long tail of old app versions

Multiple versions can run in parallel
Can break with backend

4. Deep links

Can break in old versions of the app
Corrupted state issues
Deep link implementations differ for iOS (universal links and URL schemes) and Android (intent-based).

5. Push notifications and background

Delivery is not guaranteed (signal, muting...)
Setting up and operating push notifications is complex. For both Android and iOS, your app needs to obtain a token from a server (FCM on Android, APNS on iOS) and then store that token in the backend. There are many steps to follow to get push notifications working;

6. App crashes

The first rule for crashes is that you need to monitor when they happen and have enough debugging information.

7. Offline support

Safely detecting when the phone is offline
Detecting connection speed and latency
Persisting local state when device is offline and syncing back when connection is regained.

8. Accessibility

Accessibility is crucial in popular apps as many users have accessibility needs and there is legal risk if the app is not accessible. Also, accessible apps have better quality.
The level of depth in implementing WCAG 2.1 mobile guidelines should be confirmed upfront.
Usage by voiceover/talkback;
Color/element contrast
Allowing preferences like font size is important.
Device fragmentation also needs consideration.

Implementing accessibility from the start has surprisingly low effort on iOS and Android. Retrofitting is more laborious, so better to include accessibility in design and planning process.

Accessibility testing needs planning. Automate what's possible, manually test regularly, recruit accessibility users for feedback, and enable features during development as good practices.

9. CI/CD and the build train

In large companies having dedicated infra time to handle CI/CD is important.
For smaller companies using vendors like Bitrise is the best choice due to the work involved.
Keeping the master branch green at scale

10. Third party libraries and SDKs

Responsibility for security
Feature flag to disable it
App size impact
Well maintained with updates, support, etc

11. Device and OS fragmentation

OS difference (APIs and capabilities) as well as hardware differences (screens, processor, memory)

12. In-app purchases

RevenueCat makes creating and managing in-app purchases on iOS and Android at scale easier.
In App Purchase states - IAP (cancellation, repurchase, upgrade...), ratings, discounts, credit card issues...

PART 2: Challenges due to app complexity

Having a well-defined app navigation strategy with good separation of app state is fundamental for any decent sized app.

14. App state and event-driven changes

Follow recommended state management practices to keep low the number of bugs that occur due to state change issues.
Keep state as immutable as possible and store models as immutable objects that emit state changes.
Log invalid states with info to reproduce or debug, to have details on what went wrong and how to reproduce the issue.

15. Internationalization

Both iOS and Android offer opinionated ways to implement localization. iOS offers localization export for translation, while Android is based on resource strings. The tooling is a bit different but concept is similar.

To localize your app and set the locale strings, you want to localize the strings and submit the localized strings as a separate resource in the binary. Still, with large apps and many locales, you quickly face challenges with this workflow. On the other hand, you can choose to use runtime translation but that makes updating on the native side difficult.

16. Modular architecture and dependency injection

Dependency injection is a powerful tool to keep code consistently testable across the codebase.

Conway's law says that the structure of a software system reflects the structure of the organization that developed it. This means if the company creating the software is divided into multiple teams or departments, the software will likely also be divided into corresponding parts for those teams.

For example, imagine the company has a design team, a programming team, and a testing team. Then the software they create together will likely have a design part, a programming part, and a testing part that correspond to those teams.

It's like if you and your friends decided to build a treehouse. If each friend makes a different part of the house, like the ladder, roof, and walls, in the end the different parts will fit together like a complete house, but you'll still be able to see which parts were made by each friend.

So that's the basic idea of Conway's law - the structure of a software system ends up reflecting the structure of the organization that created it! Ask me if you have any other questions.

17. Automated testing

If you are not doing a decent level of automated testing in a large app, then you are digging a hole for yourself.
Unit test: the simplest of all automated tests, testing an isolated component, also called a “unit”.
Integration tests are a step up in complexity from unit tests.
Snapshot test: compares the layout of a UI element or page with a reference image.
UI test: a test that exercises the UI and tests if the UI behaves in a specific way. https://go.mobileatscale.com/unit-testing-benefits

18. Manual testing

Manual testing is viable initially but doesn't scale. Automate most tests, but still do exploratory manual testing and for cases like payments and camera. Incorporate manual testing into build/release flow.

PART 3: Challenges due to large engineering teams

19. Planning and Decision Making

Formalizing the planning process

RFC (Request for Comments) planning process (https://blog.pragmaticengineer.com/scaling-engineering-teams-via-writing-things-down-rfcs/)
PRD (product requirements document)
iOS and Android teams working together on planning
Paralysis by analysis Teams can suffer from paralysis by analysis. Limit planning time and choose the most sensible approach. Prototyping and getting feedback can be more important than overplanning.
Balancing signal and noise (Balancing signal and noise) Teams fear creating "noise" with new plans. However, working in silos has advantages. Design docs and RFCs help quickly spread knowledge. Start by defining templates. Get engineer buy-in. Publish docs for all and iterate based on feedback.

20. Architecting ways to avoid stepping on each other's toes

Architecture doesn't matter much initially. The problem starts when many engineers modify the same files. To scale to hundreds of engineers:

isolate features
use monorepo structure
define strong ownership
have automated testing. Document the approach and consider tools to enforce it.

21. Shared architecture between multiple apps

Teams initially build apps with different architectures. The idea of unifying the architecture will emerge over time. Complete rewrites are expensive and painful. A sensible middle ground is making incremental changes towards a unified architecture, without rewriting everything. Benefits include shared language, joint planning, breaking down silos, and shared components. Introduce new concepts in new parts of the app. Complete rewrites rarely make sense, unless major app changes are needed.

22. Tooling maturity for large mobile engineering teams

In apps with millions of lines of code and dozens of engineers, native tooling starts to have performance and workflow issues. Build time is one of the biggest bottlenecks. Optimizing the build and workflows is worth it at large scale. Compared to backend, tooling for large apps is still a less solved problem. Consider building your own tools beyond what's readily available. This helps with build time, release workflows and more.

23. Scaling build and merge times

Build time for native apps can become problematic as projects grow. iOS engineers are familiar with Xcode's slow builds and same happens on Android. Apple and Google haven't prioritized improvements in this area for large projects.

Thankfully tools exist to speed up build time, like Bazel, adopted by companies like Uber, Pinterest and Grab. Still, it's work to integrate and configure these tools. The more mobile engineers, the more it makes sense to invest in improving build experience.

At some point, teams consider migrating from distributed repos to monorepo to avoid constant re-downloading. Good monorepo tools for mobile don't exist yet, so customization is required.

Keeping master branch always working also becomes challenging. If builds are fast and few merges happen per day, no problem. But with slow builds and many PRs per hour, it's complex to keep master stable and merge time low. At Uber, we built a Submit Queue to handle this, splitting builds in parallel and prioritizing changes more likely to pass.

24. Mobile platform libraries and teams

As the number of mobile engineers grows, the tendency of "reinventing the wheel" emerges, with different teams creating their own implementations of functionalities like logging and data storage.

Internal mobile libraries are created to share these functionalities, but maintenance becomes difficult over time. The original engineers may leave, quality decreases with quick fixes and the owning team lacks bandwidth for bigger changes.

Mobile platform teams are a common solution to manage shared libraries. They take on areas like build, release, tooling, architecture, performance, reliability and SDKs. There's no rule for when to create this team, but 20-30 mobile engineers is common.

Starting the platform team too early can be challenging to justify, taking experienced engineers away from product teams. Too late results in redundancy and little reuse. Balancing these tradeoffs is key.

PART 4: Multiplatform languages and approaches

25. Adopting new languages and frameworks

Adopting a new language or framework is risky, especially in complex apps. Uber's move to Swift was almost a disaster due to the drastic increase in binary size. After that, Uber started evaluating new tech more carefully.

Areas to assess include language/framework maturity, migration needs, engineer enthusiasm, risks and building a pilot project. My advice is be receptive to new technologies, but limiting the "blast radius" - trying it out in less critical parts of the app or with fewer users first.

Mobile evolution is constant and stories like the Swift change serve as warning to have a "plan B" when adopting new tech.

26. Kotlin Multiplatform and KMM

Kotlin Multiplatform and Kotlin Multiplatform Mobile (KMM) are prominent multiplatform development approaches, allowing efficient code sharing in Kotlin.

Kotlin Multiplatform, introduced in 2017, supports creating JVM libraries, native frameworks and JavaScript artifacts.

KMM, launched in 2020, provides tools that simplify iOS and Android development, including rich IDE integration and Cocoapods support.

Although companies like Netflix and Square have found success, there are challenges related to the relative immaturity of the technology in 2021, with experimental tooling and possible breaking changes impacting early adopters.

However, I believe in the potential of Kotlin Multiplatform and KMM, especially for native Android engineers, providing a smooth transition and accessible learning curve for iOS developers.

27. Developing multiplatform features

The idea behind multiplatform features and business logic: write the platform-independent logic once and reuse it across apps.

28. Multiplatform app development versus native

Motivations to consider for a multiplatform app development approach:

The “need for speed”. The frustration with how long it can take to build a feature on both platforms. Wouldn’t it be great to work with an approach that promises faster feature development time, and it happens on both platforms?
The desire to have one engineer embark on two platforms, rather than needing two engineers. To make even the simplest UI changes in the mobile app, two engineers need to make two separate changes on two platforms. Both need to be tested and the implementations often need coordination. Wouldn’t it be nice if the same person could do both?
The desire for the iOS and Android apps to function exactly the same. The iOS and Android apps for the same product generally differ in some aspects. A bug will be present on Android but not on iOS. Wouldn’t it be great if both apps functioned identically?
Unifying the look of the iOS and Android apps. Over time, the design team will advocate for a shared UI/UX approach. This will make a lot of sense from both a branding and a reducing design work from two platforms to one unified platform perspective.
The desire to “hot reload”, rather than waiting for train builds. Even the smallest change in the mobile app takes weeks to ship due to the code changes that need to be pushed through the App Store. Wouldn’t it be wonderful to have the option to try out or push bug fixes without having to wait for the build process?

29. Web, PWA, and Backend-Driven Mobile Apps

Not being able to update native apps in real-time is the source of much pain, as seen in the chapters on reverting mistakes and app versions. Thankfully, there are some "magic wands" for instant updates:

PWAs (Progressive Web Apps) leverage web APIs for a native-like experience. Useful for enhancing mobile web but don't replace complex native apps.
Embedded webviews allow dynamic content, but have challenges like performance and non-native UX. Requires high effort in optimization.
Backend-driven apps send executable logic or metadata that controls the native app. Gets around limitation of not being able to easily update binaries. However, risks like Apple prohibiting executable code and challenging to version and test.

Overall, each approach has pros and cons. PWAs complement websites. Embedded webviews enable dynamic content. Backend driving the app gets around limited updates but adds complexity. Balancing tradeoffs is important.

30. Experimentation

Companies with relevant apps A/B test even small changes, to measure impact and avoid negative regressions. This goes beyond feature flags and involves controlled rollout, analysis, problem detection and post-analysis.

At small scale it's easy, but at large scale like Uber with 1000+ concurrent tests, it's complex. Tooling helps, but in-house systems are common at big companies for specific use cases, controlling the data and because it's core to the business.

Motivations for vendor solutions are cost (cheaper than building and maintaining own) and standardization (versus different teams with customized solutions).

Smaller/medium companies typically use off-the-shelf solutions like Firebase Remote Config, LaunchDarkly and Optimizely. Large companies build their own.

Another challenge is processes to track tests and avoid them impacting each other. This requires customized tooling and processes. Testing everything is common at large scale to avoid significant regressions. At Uber, even hotfixes were tested to ensure business metrics.

In summary, A/B testing is powerful but challenging at scale. Small teams use off-the-shelf solutions, large teams build their own. Good processes are also needed beyond just tooling.

PART 5: Challenges due to intensifying your game

31. The Feature Flag Hell

Feature flags are great for A/B testing, gradual and segmented rollouts. However, problems arise as the number of active flags grows:

Flag dependencies can complicate rollouts. Mapping dependencies helps.
Conflicting flags between teams break parts of the app when activated together.
Obsolete flags remain in code as dead code, worse than normal dead code since it's hard to confirm. Automating cleanup helps.
Inconsistent code around flags makes automated cleanup difficult. Consistent pattern and linter help.

Vendor solutions are mature, but custom systems are major effort unless very strong business case. Building a facade on top of vendor solutions can be a smart alternative.

In summary, flags are valuable but accrue tech debt. Controlling this requires discipline, patterns and ideally automated cleanup.

32. Performance

Bloated app startup time
Many parallel network calls
Network performance: Uber built its own network protocol
Battery consumption rate
App not responding (ANR)
Frozen frames and slow render frames
Animation and UI render performance

33. Analytics, monitoring and alerts

Monitoring and alerts for problems are common practices in backend and web, but less so in mobile. Crash reports are the most common mobile alert. However, business event monitoring and alerts are rare.

Steps for good monitoring:

Define critical business events like payment flows or signups
Map these events in the mobile platforms
Implement real-time monitoring of those events
Validate metrics with clear specs and automated tests
Cross-check data by comparing sources and doing sanity checks

For alerts:

Crash alerts are more common but limited
Alerts on key metrics during rollouts help detect regressions
Few teams do business event alerts despite the value
Noise is an issue, thresholds and regional data complex to get right

In summary, mobile business event monitoring and alerts is powerful but challenging. Requires work in defining, mapping and validating data, plus dealing with noise and dimensionality.

34. Mobile On-Call

Having a mobile on-call rotation becomes necessary once you have mobile alerts, even if just crashes. Large teams can have a mobile-only rotation. Small teams end up with a few people on call.

Having the rotation is just the first step. Training incident response is crucial for fast mitigation and minimal impact. This involves prioritization, escalation, dividing tasks and lessons learned.

Teams with few mobile engineers end up mixing it with backend on-call. This only works if there are simple and clear runbooks for most alerts. Without them, engineers won't know how to respond.

In summary, having a solid mobile on-call requires enough people, training in incident response, comprehensive runbooks, and investment in learning from each incident.

35. Advanced code quality checks

Getting fast feedback on code issues increases engineer and team productivity. Advanced code checks provide this even before code review.

Code formatting via linting ensures code follows style guide. More advanced lint rules enforce architecture and coding patterns.

Static analysis automatically inspects for more complex issues like unused variables and potential nulls. Popular tools include SwiftLint, ktlint, SonarQube, Clang analyzer and Infer. Large teams build their own.

Code coverage also helps by showing how much code is tested. Integrated in workflow, this enforces minimum coverage policy.

Pros: Fast feedback, higher quality and stability.
Cons: Integration and maintenance time.

Balancing value vs effort for tools is key.

36. Compliance, Privacy and Security

Apps and development processes often need to follow regulations and privacy guidelines. The most common are:

PII (Personally Identifiable Information) cannot be accessed by anyone unauthorized.
EU GDPR expands the scope of PII and its use.
Industry specific guidelines like PCI DSS for payments or HIPAA for healthcare.

Areas of impact in mobile engineering:

Logging of sensitive data needs anonymizing and encrypting.
Auditing parts of the app like third-party SDKs for GDPR and PII compliance.
Training engineers on privacy law implications in code.
Security checks in CI/CD help, as does specific training on risks.

In summary, compliance and privacy are crucial and require extensive work in processes, code, and auditing. The sooner reviewed the better.

37. Client-side data migration

Mobile apps storing data locally face unique schema migration challenges when updating. Migrating large local datasets is complex. Difficulties include testing migration with real data, testing upgrades from old versions, client-side logging, and handling failures. Bugs in the new schema can also break users. Ideally, the backend should be the source of truth in schema changes to avoid fragile migrations on device.

38. Forced update

Most mature apps today implement forced update mechanisms. Motivations include retiring old APIs, reducing testing and support costs, removing severe bugs, and fixing vulnerabilities. Implementations exist in Snapchat, Facebook Messenger, banks, JustEat and others.

The challenge is building the solution well before needing to use it. Testing is also crucial to ensure it will work. Google provides native updates in Android 5.0+. On iOS, custom solutions are required.

Supporting old phone models is complicated. The strategy depends on the business use case. Some apps support old versions for years, others set a cutoff. Forced update isn't just a tool, but a strategy. Covering edge cases is key.

39. App Size

App size matters because it impacts downloads and removals. Smaller apps have more downloads. On Android, bundles reduce size 35%. On iOS, limit is 200MB via OTA. Uber saw huge dropoff when passing this limit.

Lite small apps are a common strategy in emerging markets. Uber, Facebook and Google use them. Requires tradeoffs like more engineering effort, reduced functionality, and network and asset optimizations.

In large apps, static assets bloat size if unchecked. Mobile platforms tend to assume monitoring and reducing size. Installed size also important but rarely monitored.

Even if not user visible, large size impacts metrics. Worth investing in optimization if business impact significant. Otherwise not worth the effort.

Did you like?

Abílio Azevedo.