Pietro Menna Home page

Lessons learned during a Developer on Duty: Observability

A month ago, I had the chance to participate as “Developer on Duty” for my team. Developer on Duty basically means you are “On-call” and that you can receive a call if there is any system down or production down problems. When I got called, I became grateful that we had some great tools to analyze issues.

You are running a cloud system, which is live, and you have real customers using the system. Unfortunately, you do not have access to run a debugger on these systems (and I guess we should not have). How to analyze the problem then? Observability (o11y) is the keyword here: infer the internal state from knowledge of its external outputs. As output, we use not only the error messages regular users are getting but also the logs.

I became familiar with two tools: Kibana (reading Logs) and Dynatrace (Tracing). I am also aware the system has some monitoring functionalities to analyze CPU load, database, etc. But I have not yet become familiar with those tools.

Kibana supports several ways to search for information about a problem, the most common search I use is correlation_id. With it, I can see all the logs that a request triggered to be raised.

Dynatrace allows the use of the same correlation id, not only on the log of the application but also across components, such as databases, other systems, etc.

Log level per instance or service

Specifically, on one of the nights I received a call, I noticed I could not get enough information for a given problem. I got lucky and found out that it was possible to change the log level of a given instance that is running. The log level of an application is usually a parameter, or environment parameter applications read and allows setting the minimum log level to be sent. By increasing the log level, it was possible to get more details about the problem.

Conclusion

If possible, get familiar with the logging, tracing, and monitoring tools available in your environment. They can be convenient in some situations, and learning to use them should not take too much time.

Why use message queues?

A few weeks ago, I interviewed a candidate who worked with Message Queues, and I asked why you would use one? What is the architectural advantage of using it? The candidate did not know how to answer, so I explained what I thought they were for, and this post is because maybe it is helpful to others.

Let say you just opened a pizza restaurant and hired two employees: a waiter and a cooker. You just have 2 tables. Since you just opened the shop, the waiter just walks to the kitchen and screams the order. Everything is alright, you can fulfill all orders in time, and everyone is happy. The cooker must start working on what he just hears, and as soon as he finishes, he calls the waiter to take the delicious pizza to the table.

This is what we call synchronous communication. The waiter is not waiting for the cooker to fulfill the order, but he waits for an acknowledgment.

message-queue-1

Let’s say your restaurant grows due to the great pizza you deliver, and you are still using the same system (shout + acknowledgment). At some point in time, the cooker in the kitchen might be so busy that it misses one order here. As a result, your restaurant has unhappy customers, and you lose popularity. So, you invent a system that works better: you waiter has a piece of paper that describes an entire order for a given table that he leaves in the kitchen. As soon as a whole order is ready, the guy in the kitchen calls the waiter to fulfill the order.

This new system has two advantages: you stop losing orders because people working in the kitchen were too busy to listen to new orders and allow more than one cooker to work on the same orders. In addition, waiters now leave the orders in an “orders queue,” and kitchen workers process “orders.” Waiters are not particularly interested if someone is there to pick orders immediately in the kitchen now.

message-queue-2

Ok, now you saw the queue word. Message queues allow asynchronously exchanging messages between services. This means that the service that posted something is not there waiting for something to happen to continue, or we miss something if the processor is available to take a message. We do not even know if the message will be processed at all.

An HTTP post is similar to the shout method. It is synchronous communication that can be lost if the service is too busy processing something else. With the message queue, you stop losing messages by decoupling the systems.

Now you may have noticed I added more waiters and workers in the kitchen picture. This is because you no longer care about “who” processes the order. Instead, it may now be a group of services or replicas of the service, allowing you to scale the number of processors you may have. This is what it means when people say message queues will enable you to “scale.”

There are other points that I could mention, but the idea was to share this simple explanation with a story. I hope you enjoyed it.

What I learned with COVID-19

Wherever you were browsing on the internet during the last 6 weeks, you probably already read about COVID-19. In any sense, you probably have not read anything positive about. I am not trying to say that is something positive, but during these almost 6 weeks of confinement I learned a lot of stuff that otherwise I would have not learnt.

Remote working

I have already worked remote in the past, but never to this degree. Mostly I stayed home some days or worked on a customer site during my professional life until today. The most important pieces I learned about working remote are:

  • You need few items, but all of the basics are mandatory: a computer, internet connection, and phone. Then comes the space: you need a quiet room, and headphones. Without any of the above, it is pretty much impossible to work from home extended times.
  • In order to keep social relationship with peers, you must communicate proactively and with intent. This means you have to prepare before the call, but also start by talking non-work-related topics (at most 5 minutes).
  • Differentiate what requires a synchronous communication and what can be left (and it is best) to have as asynchronous communications. There are many tools that development teams use on their daily work, but they are not always used properly. Example: e-mails, instant messengers, product management tools, shared file stores, etc.
  • Closed door, open calendar: You are remote, nobody can see you (closed door), but keep your calendar updated. Colleagues and peers should be able to reach us by looking into our calendars and being able to schedule time for synchronous communication (MS Teams, Slack Calls, Skype, Zoom).
  • Use the same todo list for personal stuff and for work related stuff. This means you will use the same calendar tool to schedule such events.
  • Set clear expectations on how you prefer people to contact you, and respect how people would like to be contacted. Some people prefer that you contact them via Slack, other prefer scheduled meetings on Outlook, etc.
  • For synchronous meetings, send an Agenda at least a day before.
  • Use video when possible
  • And yes, the benefit of doubt when you read and e-mail that sounds harsh. This is key. Have empathy with remote colleagues, and always assume best intentions. It is the best for the long run. And remember, one day you might be the person who was understood as harsh and you would prefer to have also the benefit of doubt.

Be grateful for what you have

I just realized how many things we just take for granted: be it having a job (many people lost jobs), be it having food (before the crisis, I used to eat every day outside, now at home we have to cook), be it being able to stay with your loved ones (not everyone can do that at the moment, think on medical doctors or others that have to stay away from family). Not everyone has access to this, and realizing that is somehow conforting.

I am grateful that I work with very professional and nice people as peers. The company I work for is very concerned about our wellbeing and my direct boss is super comprehensive on all the needs I may have. Also, my peers and colleagues are super fun and are fully adapted as well to work remotely the best they can. It is good to be able to count with such vibe.

I am really grateful to be employed, but also, I am grateful that I am able to everyday look at my daughter having breakfast. When I did go to the office, I almost never had this chance. This is a unique opportunity for me. I am grateful to had the chance to experience this.

Conclusion

Overall, I think the crisis is a real learning opportunity. We can choose to use it to learn stuff and go out stronger and smarted than we were before it, or just complaining about it every day. I try to choose the first always.