The 7 rules of logging (Article 1/2)
From waste of time to sensitive data loss, let me take you through a journey around the importance of having a well designed logging framework and hopefully give you all the keys to build your own.
On the importance of logs
Let’s start with a few stories* where having logs could have been not only useful, but could actually have saved a company from more or less important things, from simple discomfort to sensitive data thief.
Study case #1: Binary Packed
Let’s start with a company called Binary Packed that specialised in some low level mathematics algorithms to compute output and input data for storage and shipment companies. A few years ago, they started their business by developing an algorithm that placed them on top of the market straight away because of the massive improvements in terms of costs that clients were seeing when using their product. They inevitably started increasing the development team size as their impact on the market was getting bigger and bigger.
At some point a few months after the small 5 people startup had become a 150 employees company, the head of development, Victor S, left the company, following a very interesting offer from a big company. The engineer in question was actually part of the 5 initial members and is the one that developed the core of the system, containing the principal algorithm that gave Binary Packed their major success at the very start.
Although, the success was so fast that the engineers from this company never considered implementing a logging system for their application as they were overwhelmed by feature requests from customers. But one day, someone made some changes to the core project and a few hours after being released, the support team drowned under calls and tickets from unhappy customers that were now about to send half empty trucks on the road since they updated their application. The first measure was obviously to revert to the previous patch, but then the nightmare began for the developers.
After weeks of debugging, they couldn’t understand why the issue was happening without having a proper logs system in place. They couldn’t even locate the issue. The release got delayed and therefore, the issue gradually escalated to higher levels, and at some point, the CTO of the company. For him, two solutions were there, either drop that feature completely or suck the shame in and call back Victor for help. Product wouldn’t let the feature be dropped, as very important contracts were about to be signed based on that. Therefore, they didn’t have any choice and they called their former employee to the rescue. After a few days getting back in context, he found the issue: a corner case that was not taken into account by the new engineer responsible for the core project.
That story tells us a little about how logs would have helped Binary Packed to locate the issue and fix it faster, they had the option of rolling back the previous patch to avoid customers to suffer too much from the error, and they were lucky the former employee that left the company with all the context was keen to help them. They get away with a bit of shame and a moral debt to Victor, but that could have been way worse.
Study case #2: Locked Out
For our second study case, we’re going talk about a security company: Locked Out, that specialised in account protection and password management. They offer their services to customers that want their accounts protected—e-mail inboxes, social media accounts, etc.
Like any fast growing software company, especially in security, Locked Out quickly attracted people, most of them are genuine customers but some individuals had plans to take advantage of their success to make their own name as “hackers”.
The first of many attacks happened a few years after the official launch of the social media protection software, the most popular product of the Locked Out suite. The hackers silently targeted the entire company’s system using some code injection and altering the databases little by little, a bit more every attack. After a month, the attackers had gained complete access to the system, without anyone noticing it.
They stayed calm for months, letting the company grow until one day, they decided to launch their final attack. They revealed thousands of customers emails that were supposed to be securely protected. Locked Out engineers started investigating, of course, but they didn’t know what to look for as for them, everything was going well.
The peculiarity here is that the team never implemented any logs for their database module. The exact point where they got attacked. As a security company, they thought they wouldn’t need to debug anything at that level, that they would catch any intrusion way before that. They were wrong. After this incident, they reviewed their code and found out that they were exposed to code injection in one specific form on their main software. They rapidly fixed the issue and performed a reset of their databases to get rid of the attackers’ back door.
Having logs in place wouldn’t have protected them from the few first attacks, but they could potentially have detected them earlier and stopped the attackers before they gained complete access to the system.
What the little stories we just went through together can teach us is that logs are more than a “nice-to-have”. You might not need them most of the time as realistically everything that goes in production doesn’t contain issues. But they can be a real lifesaver when it happens, so as a conclusion I would say: Invest time in your logs to avoid losing worse in the long run.
*The stories presented here are purely fictional and outlined for illustrational purposes only. Any similarity with a real company or real events is not intended.
The 7 rules of logging
In this part, we’re going to go through a list of rules and guidelines that I personally think are the main points which you should focus on if you’re looking to build a logging framework for your application or system. Some of the following will be easier to follow if the application is already up and running, and some are easier if you’re starting from scratch, but not being easy doesn’t mean less important, quite the opposite actually. The first 5 rules are, in my opinion, extremely important to follow where the last two are more like a nice-to-have.
I. One format to rule them all
It is very important to use a consistent format across your entire application. That way, anyone can understand anything logged anywhere in the application. You wouldn’t write a part of your logs in English and another part in Spanish, so why would you use different formats throughout your entire application?
That assertion implies that the format needs to be very generic. But to be understood by people reading them, the logs also need to be specific enough so they make sense. That’s why defining the format, and more precisely the fields contained in each log line is key here.
The format is a matter of personal preferences coupled with what’s supported by the logging system you will use to centralise the logs. It needs to be readable for humans, as they will likely be the ones reading them the most but it also needs to be readable for programs (aka parsable). For those reasons I personally tend to choose JSON, but you can pick any format you want as long as it respects the previous assertion: readable and parsable.
If you’ve been working on your application for a while, debugging here and there, you probably have a clear idea of what fields you need. It will require a lot more reflection if you're designing it from scratch, although brainstorming on the log format will open quite a few discussions about the application design itself, which is always good to have sooner than later, before the tech debt starts piling up.
The aim of the fields is to add more data (or meta-data) to a simple message and help it to make sense. For that matter you want to look at the following indicators:
You need to be able to locate the part of the code that a line is about pretty easily. For that, you will want to add fields like component, service, sub-component, … — depending on your architecture — That will help you understand where the subject of the logs is executed. You should also consider adding the filename and the line number at which the log got written.
One way to know if you need an extra field is to ask yourself the following question: “Can the same line get written twice for two different reasons?” If the answer is yes, then you probably need an extra field. One good example for that is if you’re using threads, you might want to add the thread ID as a field to be able to track 2 different executions of the same process in parallel. The same would apply if your application is going to run over a cluster of machines i.e. adding the IP or hostname in a field might be a good idea.
Always add a timestamp to each log line. It will make your life a thousand times easier when you have to read through all the logs. Especially if multiple components are writing to a single file in parallel, the lines can get out of order pretty quickly. With a timestamp field, your central log system will be able to put them in the right order for you.
II. Use severity levels accurately
Severity levels are key when it comes to understanding the flow of your system, whether you’re monitoring it or debugging an issue. By using them, you provide a much more granular view of the execution of your program to whoever is reading the logs.
The most common levels are DEBUG, INFO, WARN, ERROR and FATAL. They each determine, or at least help to determine, the actions required following an event being logged.
Here is an easy way of distinguishing the different levels:
Subject: Detailed execution of the operation not necessarily required to understand the flow.
What to do: Probably nothing.
Example: ”Appium server took 2.74 seconds to start”
Subject: Normal execution of the operation required to understand the flow of execution
What to do: Nothing.
Example: “Started Appium server for device 123456 on port 8085.”
Subject: Potential abnormal event that could lead to an error.
What to do: Debug before it becomes critical.
Example: “Timeout of 60s reached while starting Appium server. Retrying.”
Subject: Abnormal event that prevents the operation from completing.
What to do: Debug in priority.
Example: “Could not start Appium server after 3 retries.“
Subject: Unrecoverable error preventing the component/system execution.
What to do: Debug immediately.
Example: “No space left on disk.“
One interesting aspect of using severity levels is that they allow you to select a certain level of logs you want to push or not to your central log system. Define those “modes of execution” like Dev, Prod, Quarantine, … For example, in Prod mode you might want to retain the DEBUG level log lines on the machines but not push them in your central log system. That way you save space and also avoid unnecessary noise for anyone monitoring the system.
III. Use context in your logs
Each line of log should make sense on its own by containing the context of execution, either in the message itself or in separate fields. You should never have to read more than a handful of log lines to understand what’s going on and each of those lines should add to the whole description of your application’s execution.
Always prefer descriptive verbs to meaningless keywords. It makes things more natural to read and help whoever is monitoring to understand quickly what the application is/was doing at that time.
e.g. “Login failed” VS “User XYZ failed to login to service ABC” (user and service could also be separate fields here if it make sense for you)
IV. Optimise your logging
This rule might be the harder to apply properly, finding the right balance of logging takes time.
Too many logs would result in a noisy pile of text in front of your eyes, hard to make any sense of.
Too few would make the information delivered by them not accurate enough to do something with it or even inaccurate in some cases, leading you in the wrong path.
Also keep in mind that logging in an IO operation, which is expensive. You should therefore try to avoid logging the same information multiple times. For example, there is no need to rescue an error, log its message and re-raise the same error. A common good pattern here would be doing “error refinement”: rescue the error, log the error’s message and stack trace and raise a new error, more specific to your context. Although, you don’t want to be using this pattern all the time, most of the cases it would be better to stick to the original error and log it whenever you rescue it at higher levels so you preserve the original context of the error: stack trace, message, etc. Again, it is a matter of what suits your needs.
V. Do not log sensitive data
Even if your logs are likely not publicly accessible, they are often consolidated in a central system, sometimes operated by a third party. Even if they’re stored in-house, it’s pretty common to spend less effort on protecting the log system than a production database for example.
For that reason, you need to be aware to not log sensitive data such as passwords, public IPs or URLs and keep an eye open for them when you’re reviewing your team’s code. By doing so, you’re protecting yourself against an escalation of a potential attack: If someone somehow gain access to your central log system and start reading/parsing through your logs, they won’t be able to go any further using info from your own logs such as a public API and the keys to access it logged in the same message.
One pretty common way of implementing this rule is to integrate redaction hooks in your logger, that would parse your logs looking for potentially sensitive data and replace them with a nice and neat [REDACTED]. That way, if you miss some sensitive data during review, it will still get protected afterwards.
VI. Use logs as part of your metric system
You can, and should integrate your logs as part of your metric system. It doesn’t replace a proper alerting system but can easily complete it and it is usually made easy by CLS solutions to do so.
If you followed the previous 5 rules, you have most of the data about your system execution, you know for sure that the information in your logs is accurate, non redundant and divided across different severity levels.
You can then take advantage of that by identifying trends in your logs. For example you can count the number of logs under the severity “WARN” and use this to detect an increase in the number of warnings, probably meaning that your system is about to crash or, if you have all the fallbacks in place, taking up more resources than needed.
You can use all the fields you created to group the logs and define more specific metrics you want to monitor the trend, for example count the number of ERROR messages for a given component that is critical for your system. You shouldn’t, however, go as far as parsing the content of the logs to find for example durations and things like that. Those should be pushed separately in your metric system.
VII. Writing logs shouldn’t slow down or hang up system execution
As mentioned in the fourth rule of this article, logging in an expensive operation and takes up time and resources. However, it is not something that is part of your application on the same level as your core component. It isn’t less important, but it shouldn’t be taking resources from it.
By logging synchronously, i.e. writing directly to a file within your code, you are gonna be waiting for this write operation to be completed and the rest of your execution will be delayed, every single time you’re writing a log line. But there are many things that comes into play here when you write to a file: opening the file, creating it if it doesn’t exists, waiting on the file to be available in case multiple processes write to it, etc.
All those can be taken care of by creating your own library around your logging (or use a prebuilt one if it suits your needs) but all those checks will again add to the delay created by your logs. To avoid this, you can try and aim to log asynchronously. By modifying your library, you can make it deal with all the logs in parallel of your main application’s execution, by making use of a message queue for example.
By following the 7 rules defined in this article, I believe you can make the best out of your new logging framework and perhaps catch production issues, or even attacks, quickly and easily.
In this article, I’ve been referring a lot to a “Central Log System” (CLS). Although it’s pretty much self-explanatory, that part of the work has been left aside voluntarily to not interfere with the definition of the log framework itself.
In my next article, I’m gonna walk you through the decision we made here at BrowserStack to implement our central log system based on the stack named EFK (Elasticsearch, Fluentd, Kibana).