All blog posts Miscellaneous 2 mins read

Looking Back on our TechTalkThursdays #11 and #12

Thomas Hug
Written by
Thomas Hug
Published
September 22, 2020
Share this post

Besides our normal TechTalkThursdays in the evening, we tried new times during lunch and at 08:00 in the morning. Neither of them proved to be better than in the evening as we didn’t have the same amount of participants.

We use this article to summarize the topics of Demian Thoma and Daniel Lorch.

How a Titan empowers our Cloud Monitoring Infrastructure

Nine is hosting and managing thousands of servers for its customers. They recently moved to a new monitoring solution based on the open-source tools around Prometheus. Nine’s Demian Thoma talks about how nine implemented its new monitoring solution and how it gave them more insight into their infrastructure.

Nine was using Nagios before switching to Prometheus. By changing their monitoring stack, it allowed them to simplify the setup, get more insights into their services and to remove a separate analytics stack of infrastructure.

Site Reliability Engineering: What you need to know about Service Level Indicators, Service Level Objectives and Error Budgets

What does reliability mean to you? In his talk, Daniel Lorch reiterates the claim that reliability is the most important feature of any system. But services need to be just reliable enough to make its users happy – investing too much in reliability results in higher cost (engineering time and infrastructure) without added benefit. Investing too little on the other hand will result in unhappy users.

How do you determine and agree upon what “reliable enough” is to your services and your organization? Site Reliability Engineering provides tools and concepts to formalize this discussion, notably:

  • Service Level Indicators (SLIs): a monitoring metric that is indicative of a user’s goal
  • Service Level Objectives (SLOs): a target on an SLI that if barely met, keeps the users happy
  • Error Budgets: the maximum amount of time the system can fail without contractual consequences. It is the remainder / inverse of the SLO

Watch the 30’ talk below to learn about these concepts and see how an example SLI/SLO is being defined for a fictitious game platform. Links to further information are provided at the end of the talk.

On this occasion, we would like to once again thank our speakers for presenting!