Dec. 13, 2024

Tackling the Monolith/Microservices Dilemma at Instawork

published by	Adam Stepinski
in blog	Instawork Engineering
original entry	Tackling the Monolith/Microservices Dilemma at Instawork

First, There Was The Monolith

Instawork’s main product is implemented as a monolithic codebase using the Django framework. This choice has served us well from the founding of the company, and it exemplifies our guiding principle of “Respecting the Craft”:

Develop deep expertise with a limited set of tools, and pick technologies carefully to solve a particular problem

By sticking with a monolith, engineers at Instawork need to develop expertise with only one language (Python) and one set of conventions (Django). With that set of tools, our team is able to work on any feature across the entire product. This flexibility has helped us stay nimble and pivot engineering effort to the most important initiatives at the company. Additionally, a monolith greatly simplifies our CI/deployment process and production monitoring setup. This means more engineering effort can go to developing the product, rather than to maintaining a complex infrastructure (such as Kubernetes).

Every engineering decision comes with tradeoffs. Common criticisms of monoliths include:

scalability issues
long CI/deployment cycles
lack of modularity (impacting development time)

These problems become more acute as the size of the monolith grows. Indeed, over the last year we started noticing early signs of issues in our Django codebase: spaghetti code and tight coupling between modules. We decided to proactively address the problem before it crippled our development velocity.

Microservices, or not?

*grug* *wonder why big brain take hardest problem, factoring system correctly, and introduce network call too*

Microservices are commonly seen as a solution to the problems of a growing monolith. But I’d argue it’s rarely the right choice to jump from a monolith straight to microservices. It’s true that each individual microservice will be easier to scale, test, and deploy than the monolith it came from. However, that comes at the price of a more complex orchestration needed across all services. Individual developers pay this price when figuring out which set of services they need to run locally to test their feature end-to-end. And the company feels the pain when it needs to hire dedicated engineers to maintain and monitor a complex infrastructure in production.

Most importantly, moving to microservices doesn’t necessarily make the overall system more modular. It’s still possible to have tight coupling between services. Just because there’s a network boundary between two pieces of code, doesn’t automatically make them loosely coupled. It’s all too easy to end up with a distributed monolith: all of the pain of spaghetti code, with the added fun of network latency/errors and reasoning about a distributed system.

Starting to modularize

*A snapshot of dependencies between modules in the Instawork monolith*

Rather than jumping from a monolith straight into microservices, it makes more sense to refactor and modularize the monolith first. Even without a network boundary between modules, it’s possible to write loosely-coupled code that communicates over clean, well-defined interfaces that operate on simple data types. This approach is called the modular monolith, and I think it has the best attributes of microservices and monoliths:

CI and deployment remains simple
we don’t need extra monitoring for each service
no network latency or network errors introduced into the system
Looser coupling between modules = easier development and faster testing

Additionally, a modular monolith is a great intermediate milestone for potentially breaking out modules into separate services. By first establishing a clean interface between modules, the process of pulling out the code and switching the interface to network calls becomes much easier. And the decision to pull out a service can be based on reasons other than improving modularity (such as resource utilization or team organization).

We made the decision to evolve our monolith into a modular monolith. Unfortunately, the Django framework encourages tight coupling between database models, serializers, and views. While that coupling is great to build features quickly, it was now a hindrance to our modularization efforts. To encourage a new style of development, our platform team introduced Modularization Guidelines, an internal document of best practices to write code in the modular monolith style. The doc covered:

How to expose a clean service interface between modules
The use of python dataclasses to send and return data through the interfaces
Advice on when to create a new module vs adding code to an existing module

We evangelized these guidelines throughout the team and tracked adoption. And indeed, new code was being written in a more modular fashion. But we soon realized that a set of guidelines was not enough.

It was still easy for an engineer to accidentally import code from an unrelated module, leading to tight coupling
Likewise, it was too easy to call deep into modules, rather than using the public interface
New code was easy to write in a modular fashion, but the bulk of the monolith was still tightly-coupled, and we didn’t know where to start pulling it apart.

Guidelines were not enough, we needed enforcement and insights too.

Going Faster with Tach

Enter Tach, an open-source tool to enforce modularity in large codebases. Tach is developed by Gauge, whose mission is to untangle tightly coupled monoliths. Getting started with Tach is easy: simply pip install tach, and run tach mod to mark the modules in your codebase. This automatically generates a config, which you can sync to your existing dependency state with tach sync.

[[modules]]
path = "apps.pro"
depends_on = [
  { path = "apps.booking" },
  { path = "apps.content" },
  { path = "apps.pricing" },
  { path = "apps.shift_requirements" },
  { path = "backend" },
]

Above is an example of a module in our monolith. This configuration in tach.toml means that adding an undeclared dependency to the apps.pro module will be flagged as an error.

Additionally, Tach lets us define interface rules that reflect our modularization guidelines:

[[interfaces]]
expose = [
  "services.*",
  "selectors.*",
  "constants.*",
  ".*dataclasses.*",
  ".*types.*",
]

This example shows which members can be imported from a module. Services and selectors refer to the public methods of a module, while constants, dataclasses and types represent the public data types used in those methods. If an engineer tries to import something else, e.g. a Database Model, into another module, Tach will automatically throw an error.

Some things we’ve appreciated about Tach so far:

It’s incredibly fast. We found some developer tools like mypy can take a minute to run on our codebase; Tach can analyze hundreds of thousands of lines in milliseconds.
Easy to configure. tach.toml is easy to read and extend, which means any engineer can expose their modules and specify their dependencies.
Easy to adopt. Gauge has been thoughtful about incremental enforcement. With a codebase as big as ours, we can’t adopt dependency and interface checks all at once.

In addition to using Tach locally, we’ve been using a new service from Gauge that integrates these checks across the entire team. Their platform gives us automatic CI checks in GitHub, a web UI that shows modularity violations with proposed fixes, and provides key insights with a modularity score tracked over time.

The team has continued to ship updates that more effectively surface the insights we need. As we dial in our enforcement configuration, we expect the platform to function as a guide toward low-hanging fruit for active refactoring. In particular, we anticipate integrating our codemod-based approach to refactors with the insights from Gauge in the future.

Conclusion

We’ve only been using Tach and Gauge’s web platform for a short time, but the level of enforcement and insights has been night and day for our modularization efforts:

Over half of our engineers have logged into the platform to investigate violations and get advice on how to adhere to our guidelines
Over 300 modularity errors have been resolved in the last 30 days (dependency and interface violations)

Now that the team has exposure to Tach, we’re making the CI status checks required. This will ensure that all code changes are improving our modularity, rather than making the problem worse. We expect to see significant improvements to developer velocity and platform reliability as we embrace the modular monolith. Stay tuned for updates on our progress!

Tackling the Monolith/Microservices Dilemma at Instawork was originally published in Instawork Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.