You need to be this tall to do micro-services

Hello, is this thing on?

Alright, now that we’ve established that this is a place where I publicly write about my experiences, I thought it’d be interesting to start with “all the steps you need to do before you consider yourself doing micro-services the right way”.

Some background

About a decade back, I was an entry-level software engineer. I was working alone on small projects, while I was breaking into all the new technologies. Containers were not a hot topic back then. I was deploying on VMs, using supervisor, fabric, and provisioned infrastructure by hand (you know, doing button clicks on AWS dashboard). I’m not even sure if micro-services where a concept at that point, but I have vague memories of SOA appearing in more and more tech articles.

Fast forward in 2020, the tech stack I’m working on is almost entirely different. Golang, Python, everything containerized - usually Kubernetized. Quite often in a microservices-based architecture, when the circumstances call for it.

Today’s state

Based on my professional experiences so far, I’ve come to the understanding that there is a wide list of maestries you need think of, if you’d like to deploy a full-blown micro-services based sophisticated system. And this may be confusing for newcomers to this field, or just entry-level software engineers who wonder how the heck tons of small components get wired up together and backend engineers can have a hollistic overview of them.

So I compiled a list, of the most critical bits and pieces in the micro-services world.

Are you sure?

Before you start building your new shiny product or chop off your monolith in micro-services, you also need to evaluate the weaknesses to avoid some pitfalls. Micro-services come with

scalability: smaller, ofter simpler apps, easily adjustable in terms of replication, sharding or load-balancing
cleaner codebase: a smaller codebase is easier to maintain in a clean and understandable state
independence: services are independent by all means, offering liberty in decisions, eg. deployments or database migrations
performance: cluttering can be avoided and usually independent services use isolated hardware resources as well, eg. database server

But it would only be fair if I gave you a heads-up of the downside as well:

overhead: architecture becomes more complicated, you may resolve your monolith-derriving issues but you sign up to a ton more and you just need to do more and better planning.
data seggregation: this is related to the performance part of the upsides list. As databases tend to be separated and resources isolated, that also means data living in them are separate. Marshalling/unmarshalling data from other services is a pain as well, especially if interface{} is your favorite field data type.
operations complexity: yes, your operations will become 10x more complicated. You will need to provision and maintain new layers of abstractions that you did not have before. Both your Dev and Ops teams will need to adjust to this, it is not a plain simple lift-and-shift process.

Are you still up for the challenge? Read on.

CI/CD

First of all, you need to have a working CI/CD infrastructure, functioning like a well-oiled machine. If you don’t, start investing ASAP in achieving that. Gitlab can help to that and they have an interesting feature called Auto DevOps for you who don’t have dedicated ops teams to take care of it. That could be a very good starting point.

Each service can and will have its own CI pipeline. Adaptations may be necessary for some of them, but my suggested pipeline steps are:

build
test
code quality check
basic load testing
push to registry (for dockerized environments)
deploy
cleanup

MS Azure has published a fairly good article on this topic.

Also, an easy to deploy and much beloved CI system is Drone. The reason I love this is because you can easily create plugins with no specific structure, as they are spun up inside short-lived containers. This is extremely important as, even if you already have infrastructure automation, you will need to do a ton more.

Monitoring and Observability

So many buzzwords here: grafana, prometheus, loki, ELK, influxDB, graphite, and much more.

The main purpose here is to:

have a time-series DB to keep your metrics data. Set a retention period of at least 20 days, to be able to debug stuff recursively and not only using last-hours metrics.
have a dashboard to visualize those metrics. Create different screens to aggregate data and also have service-level visibility.
set up alerts to get notified when things go south. Shoutout to AlertManager for this.

Those said, in the code side mind that you should:

export service metrics that are relevant to the service you’re building.
set healthchecks up. Those are not only useful to be on top of uptime metrics, but also to kill off eg. erroneous app instances (cc containers/pods in the K8s world) and spin up new ones.

Want something simple? Get started with a Loki/Grafana setup.

Logging

Distributed logging is a thing. Especially if we combine it with the Monitoring and Observability part, it’s a win-win situation. For example, Loki automatically grabs your container logs.

You could go 3rd-party as well. SaaS offerings that are good in this area incluce Datadog, New Relic and Rollbar.

Tracing

Something went wrong. You need to understand what happened. In a monolithic environment, checking the logs would be sufficient to understand what happened.

In micro-services, there’s a chain of events that happened for every single outcome. Multiple different services where involved, infrastructure failed in the meantime and a forgotten comma in service’s X code caused some failed responses. You need to be able to track the series of events that resulted in failure in order to troubleshoot transactions in a complex distributed system.

Jaeger is a good project for that, and graduated CNSF (I tend to be very favorable towards CNSF projects).

Internal tooling

You will be surprised how much code will be shared between teams. Tools for logging and debugging. Service skeletons for setting up new database connections, manage application secrets, establish that RabbitMQ connection on the 2nd try because the 1st one failed. Apply exponential backoffs when rate limited by an external API. And much more.

Get ready to invest a lot of time in designing and building internal tooling. Try to have your teams in sync. The core components can and many times should be shared.

Testing (unit, api, integration)

Many services means less time for manual e2e testing. Automate things and set them to fail early if needed (that’s why you need a working CI/CD infrastructure).

Test your models and create unit tests
Test your functionality and mock stuff
Create integration tests to avoid huge surprises later on
Give your API endpoints a go (via tests) before releasing a service

APIs and interfaces

Services need to communicate between each other. Events messaging automatically comes to mind, with a message queue system (eg. RabbitMQ, NATS, etc). However, there are cases where syncronous responses will be needed based on your business requirements.

Exposing good internal APIs is the first step here. Avoid re-using public APIs for internal use (or the opposite, which seems scary IMO). Implement drivers for your newly-created services, consuming the service’s internal API but returning DTOs/data-models - take care of marshalling/unmarshalling as well.

That said, when designing for micro-services, you need to have a solid idea of contracts between your services, as the architecture depends on that.

Scheduling and orchestration

Alright, enough with hiding behind our finger. We’re talking about Docker containers all this time.

And when we’re talking about Docker containers, we need to pick our path in terms of scheduler and orchestrator before shipping anything to production.

The choice generally come down to these options:

Docker Swarm: comes with docker installations and is a complete, yet simple container orchestrator. Docs and walkthroughs can be found at the official Docker documentation.
Kubernetes: Google product, heavily battle tested and a hype as well nowadays. This is not exactly a scheduler, as it is a huge project combining a dozen of different components, one of them being scheduling your containers. For production workloads where you need sophisticated integrations and operations automation, that is a very solid choice. Be prompted though, it’s heavily complicated and comes with a steep learning curve. (Docs)
Nomad: plain simple, just ultra-light scheduling (35Mb executable). This is another Hashicorp product, and you can expect 1st-class support and integrations with the rest of the Hashicorp product suite, eg. Consul for service discovery and more, or Vault for secrets management. Considering Hashicorp’s great learning resources, you can get up and running with Nomad within hours. Project page: Nomad

(Bonus) Serverless: serverless technologies have evolved tremendously during the last decade. That list would be incomplete without even a reference to this option. During the limited time I toyed around with serverless, I used the Serverless framework with AWS Lambda.

Service discovery

Service-mesh or not?

Your services will need to somehow communicate with each other. Find what’s suitable for you. In a docker-compose.yaml-based setup, the integrated service discovery mechanism could be sufficient in a simple architecture. What if you have hybrid infrastructure spread out to a multi-region deployment or even combining different cloud providers?

Service meshes have come into play lately. They are an abstraction layer for your networking needs that take care the communication between your services. And they have other goodies on top of that, eg. circuit brakers, healthchecks, latency-based routing and more. My favorite is Consul by Hashicorp. You set it up once and forget it even exists.

IaC and composable infrastructure (optional)

With the number of services always increasing in your organization, it can quickly become common that new infrastructure is to be required. Instead of manually scaling up infrastructure resources via your cloud provider’s admin panel, why not follow the Infrastructure as Code principle?

Imagine describing your infrastructure as code, with tools like eg. terraform or AWS CloudFormation. Changes would be stored in a git repository probably making it a source of truth, state can be centrally stored so that you can share it between multiple teams and GitOps will make your life a bit easier probably while trying to manage this whole mess.

Conclusion

I really hope this could be useful for someone else other than me, writing down about my interests. Chances are, only a couple of friends will read it, but hey, Internet is about sharing knowledge so you never know. I’d be happy to answer any questions you may have on Twitter/email - just shout me.

Next articles will include more in-depth analysis of those parts and even code examples to get you up and running.

Update: edited the Scheduling and Orchestrators section and added the option of Serverless. Kudos to my good friend John Dritsas for reminding me about it.