If you want to know if an application is running, implement health checks. I hope I never have to deal with the pattern suggested in this article in a production system.
Yup, he's trying to solve monitoring and logging with the same setup. I don't think I'd appreciate it, but like all the IT horror stories can be probably made to work.
This post falls into a common trap; conflating logging with metrics.
Log interesting things, where interesting is defined as context outside what the "happy path" execution performs.
Collect and make available system metrics, such as invocation counts, processing time histograms, etc., to make available what the post uses log statements to disseminate same.
Thanks for taking the time to reply! I'm relatively new to working on this type of system (large scale, event driven) and half posted because I know there are people on HN way better than me at this, and was curious about their opinions.
In the end, what's the difference between a log and a metric? Is one structured, and one unstructured? Is one a giant blob of text, and the other stored in a time series db? At the moment I guess I'm "logging my metrics" with structured logs going into Loki which can then unwrap and plot things.
You and the other commenters have given me the vocabulary to dig more into this area on the internet though. Thanks!
As a person who has worked in and around logging and big data processing for 16 years now, including almost a decade working as a senior in professional services (currently a global security architect) directly for one of the largest big data companies, here is my opinion on logs vs metrics:
A log entry should capture an event in time, for example: a person logging in, a failure, a record of a notable event occurring, etc. These should be written at the time they occur when possible, to minimise chance of loss and to minimise delay for any downstream systems that might consume the logs. Arguments for batching could easily be made for systems generating very high volumes of logs.
Conversely, a metric is a single value, point-in-time capture of the size of something, measured in units or with a dimension. For example: current queue depth, number of records processed per second, data transfer rate in MB/s, cpu consumption percentage, etc. These can/should be written periodically, as mentioned in TFA.
> I'm relatively new to working on this type of system.
> In the end, what's the difference between a log and a metric?
Don't let me put you down, but writing a logging advisory blog post when you don't know the difference between a log and a metric seems like a peculiar thing to do.
But I'm not shaming lack of knowledge, we all had to learn somehow.
> In the end, what's the difference between a log and a metric?
Essentially, a log entry is the emission of state known by an individual code execution path at the point the log entry can be produced, whereas a metric is a measurement of a specific runtime execution performed by the system.
Emits a log entry capturing the processing state known when the statement is evaluated. What it does not do is separate this information (a time-based attribute in this case) from other log entries, such as "malformed event detected" or "database connection failed."
More importantly, putting metrics into log entries forces timing to include log I/O, requires metrics analysis systems to parse all log entries, and limits the type of metrics which can be reported to be those expressible in a message text field.
Maybe most important of all, however, is that metrics collection and reporting is orthogonal to logging. So in the example above, if the log level were set to "error", then there would be no log-based metric emitted.
This is a reasonable first pass answer, but there's more nuance to this...
> What it does not do is separate this information
Logging at scale should really be structured, which means that you can trivially differentiate between different types of log message. You also get more dimensions all represented in that structure.
> limits the type of metrics which can be reported to be those expressible in a message text field
This is another example, logging shouldn't be text based ideally. You might have a summary human readable field, but metrics can easily be attributes on the log message.
The more I work in this area the more I'm realising that logs and metrics are pretty interchangeable. There are trade-offs for each absolutely, but you can convert logs into metrics easily (Datadog does this), and with a bit more effort you could turn a metric into logs if you wanted to (querying metrics as rows in a SQL database is handy!).
Metrics collection is also not necessarily orthogonal to logging, it depends on your system. From a server, you might have logs pushed to an external source and metrics pulled from the server by Prometheus, but that's just implementation details. You can also have logs pulled from log files, and metrics pushed to a statsd endpoint.
I've worked on mobile apps where metrics get aggregated locally and then pushed as log events to the server with one log event per metric and dimension set, only for the server to then typically turn them back into metrics.
It's good to understand the tradeoffs, the technology, whether you're using push or pull, where data is spooled or aggregated, data costs, etc. But this stuff is all pretty malleable and there's often no clearly right answer.
I think what you're saying is that you can make a logging system LARP metrics. At the end it's logging on fd 1 and 2 and metrics are usually over http, but ofc you can dump "metrics" into stout, it's not as practical with what tools are built for what.
In my local fun projects that run on my machine I might dump metrics into the logs because it's practical, but it doesn't make it "right".
I know and understand the reasons for that rule, but it’s one of the first ones I disable in linters. The theoretical benefits in the context of the systems I work on aren’t worth the extra friction.
The question is whether you want to do your aggregation by unit time at the application level, or at an observability layer. You're absolutely right that the end user of metrics wants to see things grouped by time - but what if they want to filter down to "events where attribute X had value Y, in 10 second increments" but you had decided to group your metrics by 15 second increments without regard to attribute X?
Various companies, both in-house for big tech and then making this more widely accessible, started to answer this question by saying "pump all your individual logs in structured form into a giant columnar database that can handle nearly arbitrary numbers of columns, and we'll handle letting you slice and dice metrics out of any combination of columns you want. And if you have an ID follow the session around between different microservices, and maybe even all the way to the browser session, you can track the entire distributed system."
Different people might say that Datadog, Honeycomb, or Clickhouse (and the various startups backed by Clickhouse as a database) were the ones to make this pattern mainstream, and all of them pushed the boundaries in one way or another - nowadays, there's a whole https://opentelemetry.io/ standard, and if you emit according to that, you can plug in various sinks made by various startups, and choose the metrics UX that makes the most sense for your use case.
I'm a huge fan of Honeycomb - when I know a certain issue is happening, I can immediately see a chart showing latencies and frequencies, and click any hot spot to filter out the individual traces that exhibit the behavior and trace the end-to-end user journey, with all the different logs from all the systems touched by that request. And I can even begin this discovery from a single bug report by a single user whose ID I know. It's not just metrics - it's operational support. And if I'd pre-aggregated logs, I'd have none of this.
But of course, there are systems where this doesn't make sense! Large batch jobs, high-performance systems with orders of magnitudes more events than a standard web application... it's not one size fits all. That said, I think knowing about modern observability should be part of every developer's toolkit.
A quick summary that does it for me: a log is something you read, a metric is something you measure.
Usecases:
Log: search, get context, read
Metric: measure, plot dashboards, define alerts
My theory for the concepts being so mixed up together: you use both to troubleshoot, and I think the old school way to emit metrics was to parse logs and turn that into measures.
I love how open and non-defensive this comment is :)
There are a few ways to slice this, but one is that logs are human-readable print statements and are often per-task. E.g. if you have 100 machines, you don't want to co-mingle their logs because that will make it harder to debug a failure. Metrics are statistics and are often aggregated across tasks. But there are also per-task metrics like cpu usage, io usage etc.
They can both be structured to some extent. Often storage strategies might differ but not necessarily. I think at Google the evolution of structured logging was probably something like (1) printf some stuff, (2) build tooling to scrape and combine the logs, (3) we're good at searching, but searching would be easier if we just logged some protos.
I think logs are basically self-explanatory since everything logs. To understand why you would want separate metrics, consider computing the average cpu utilization for your app across a fleet of machines. You don't want to do that by printf the CPU usage, grep-ing all the logs, etc. You could try to do that with structured logs, and I'm sure some structured logs SaSS companies would advocate that.
If you're new to this space, I really liked the book Designing Data-Intensive Applications.
You might want to check out this very nice article on reservoir sampling, which discusses its application to logging: https://samwho.dev/reservoir-sampling/
I'm not sure I want to weigh in on "log" vs "metric"... but I did want to add some thoughts on logs in general.
If you need to "log" something to give users feedback as the system is running, it may be less of a log and more of a progress or status output.
Logs to me are things which happen and I want to be able to trace later, so summarizing or otherwise dropping logs that come in quickly in succession would be a problem. If I need to filter I pipe to grep, otherwise I can just save it all and read through it later.
Status messaging, which may be informative about your process is useful, and if its goal is to be observed real-time, then yea. A message or two a second seems like a good goal for consistency.
These are just two very different use cases to me. And generally I find the former critical to get right, while the later may be nice to have and may lead to discovery by nature of making it more accessible.
Metrics model some measurable, quantifiable state.
In high volume systems both can then be observed through various sampling techniques. A key item is that sampling is good to handle separately to application logic creating those signals as it may change over time or be dynamic.
> The moment of capturing a measurement is known as a metric event
Which suspiciously reads like a log.
In practice, a metric is an aggregate of events (the "metric events") when you're not interested in the individual event but, but in the aggregate itself. For practical reasons this is not implemented with logs but with more primitive technical events emission.
This is not fundamentally incompatible notions. If you do an electrocardiogram, you might be interested in your BPM, but it is deduced by the full log of each beat. The segregation we do in computing is more practical than fundamental.
Read the SRE book, maybe some of the highlight posts only. It will give you the right jargon and a lot of wisdom that you can then simplify for your use cases.
Logs should be bursty, because they're most useful when debugging rare issues. If you have identical log lines, then that should have been a metric instead.
Metrics should be sampled based on frequency, because they deduplicate. I'm a huge fan of logarithmically sampling metrics.
I enjoyed our metrics systems at Amazon’s and wrote one with a similar API at Okta and should really look at writing another one to open source.
One of the huge missing things in metrics systems, imho, is keeping granular metrics in the context of a business operation and then using late aggregation for trends. Last I looked nearly every metrics systems either logged individual events and and required processing for any rollup or aggregated too early and you couldn’t determine the effect on any individual operation/request. There’s a happy medium where you can get per-request counts, stats, and timing and still roll those up at the host/data center/region/granularity to get higher level trends.
Most metrics APIs are incompatible with this idea, however.
You're likely talking about "wide events" Which is essentially as many dimensions as possible attached to an event.
I believe Meta was the one to develop an internal tool for handling this named Scuba.
I thought the post was nice. I've written scripts before where I'd just print by count and be overwhelmed. I should've just used time instead of changing the count number
One exception to this is batch scripts and other cli tools with a clear start and end, like an installer, rsync, curl, dd, etc. Setting up metrics here is way overkill and the user may still be interested in the progress. Easiest made available through logs. Curses UI could be a nice middle ground but also very often overkill.
Metrics are way quicker to query due to aggregations and tend to be more stable as features change.
It's good to save metrics for things that remain true under arbitrary aggregations. E.g. sum, count, maximum and avoid things that do not survive aggregations such as percentiles.
> This post falls into a common trap; conflating logging with metrics.
This isn't as much "conflating" as it is constructing an ad hoc metrics subsystem that exports the metrics to the logs.
There's no theoretical difference between exposing a prometheus endpoint that's scraped every x seconds and printing the same data to the logs every x seconds.
Filter and aggregate after you log your metrics, traces, log messages, etc.; not before.
You can worry about data retention, rollups, and other strategies for limiting data storage separately from the systems that emit the data.
At least with the right data stores. I kind of like what opensearch and elasticsearch do for this. In Elasticsearch you have a data stream. You configure it to roll over based on time or data size. Once rolled over, indices are read only; new data appends to the current one. You then can define life cycle policies to decide what to do with the old ones and e.g. move them to cold storage, transform them with rollups, and eventually delete them.
With application logging, you typically assign different log levels. Trace and debug are typically disabled in production (or should be). Info can be quite noisy. Warn tends to be repetitive (because developers tend to ignore warnings and will never fix them). Errors should be rare.
I have my system configured to start emailing me if errors get logged. An error means something is broken and needs to be fixed. Zero tolerance on errors. When an error happens, all the other log information provides me context. So there's value in retaining that. But only for a few days at best. Long enough to survive a weekend or things like Christmas. But after that it's just noise. I have a hard cut at about two weeks. Some places you need to store stuff longer for ass coverage reasons.
Data retention comes at a price of course. I've seen companies log ginormous amounts of data and ignoring all their errors. 30GB per day. Absolutely appalling. Me: it looks like your database layer is erroring non stop (constraint violations and worse); you might want to do something about that. Them, ah no that's just normal we just ignore it (php shop, incompetence was the norm). Me: so how do you know when something breaks?! Them: ......?!
My well paid consulting gig was beating some sense into this operation as one of the managers noticed they were spending hundreds of thousands per year on this nonsense. My fee was a rounding error on that. Easiest job ever. But kind of cringe worthy once I started looking into what they were actually doing and why. Mostly it's just, "yeah some guy set that up once and then we never looked at it and he left. What are you going to do?!". There was a lot of that with this company. Just absolutely nobody that even cared about the waste of resources or getting any meaningful feedback from their logging. If that's your team, you need to do something about it. That's your job and your not doing it well. If you need an external consultant to tell you, you might want to reflect on the notion of majorly shaking things up a bit.
Metrics should be emitted in separate stream and never by logs outside corner cases. Logs should be used to determine WHY the system is having issues but never IS the system having issues.
Log alerting is a fools errand that looks like a great idea at start but quickly becomes a sand trap that will drive future people crazy and at scale, will overwhelm systems.
Why is log alerting bad idea?
Every log becomes a metric point that must be dealt with. Therefore, the logging system must be kept operational and error free. However, due to other problems below, this system quickly becomes a beast of it's own.
Logs are generally much bigger then KV of <Metric> <Value> so there ends up being a ton of filtering going on in logging system, adding to the load.
Logging system probably does not understand rates so you end up writing gnarly queries to be like "Is this first unhandled exception?" in 10m or my 50th in 10m. Query in Prometheus is much much simpler.
Each language logging library handles things in different way so organization must be on point to either A) Keep log format the same between all different languages. B) Teach the logging system how to manipulate each log into format that can be handled by alerting system. Obviously A causes massive developer friction and B causes massive Ops friction.
Finally, I find people doing logging tend not handle exceptions as well because they can just trust logging system to alert them on specific problem and deal with it manually.
So for future Ops person who has to deal with your code, I'm begging you, import prometheus_client.
I've noticed that for some reason developers really like using logs in place of actual metrics. We use Datadog, and multiple times now I have seen devs add additional logging to an application just so they can then create a monitor that counts those log events. I think it's a path of least resistance thing; emitting logs is very easy, and counting them is also very easy. Reporting actual metrics isn't really difficult either, but unless you're already familiar with the system it's more effort to determine how to do it than just emitting a log line, so yeah.
Because when the application is breaking it's good to know why! Logs can be just as ephemeral as metrics -- in many cases, even more so. They're not even mutually exclusive.
Where exactly does this anti-logs sentiment come from? Is it because tools like datadog can be lackluster for reading logs across bunches of hosts?
For me, I don't use Datadog so it's not that $ParticularTool does not work with logs, it's all stuff I put in my original post, it's a ton of samples, filtering puts heavy strain on the systems and it's extremely brittle IME.
If you have good metrics, you can generally get much further not even logging aggregating outside tossing everything into STDOUT and checking on it when you have alerts.
My experience is that metrics may tell you something is wrong, but logs are required to tell you what went wrong and why.
A simple fixed-length rolling buffer can get you pretty far for logging, but it isn't something you necessarily want to get off-device except when something bad has happened.
log all major logical branches within code (if/for)
if "request" span multiple machine in cloud infrastructure, include request ID in all so logs can be grouped
if possible make log level dynamically controlled, so grug can turn on/off when need debug issue (many!)
if possible make log level per user, so can debug specific user issue
The only one I'll add is: If your logs are usually read in a log aggregator like Splunk or Grafana instead of in a console or text file, log as JSON objects instead of lines of text. It makes searches easier.
Following this advice, I've seen service where you could say the main function was to produce logs, and the actual response to the user was only a small part of the traffic generated.
What we really need is smart logging: only log the full span when an error is detected, otherwise no need for it. But it's not a very well supported case.
I worked at a place that banned logging levels. Everything sent to to the "log" function gets written to the logs, period. And then the rest was left to the services' maintainers to figure out.
That, combined with a real, genuine devops scheme where the people implementing the system and the people keeping it running in production where on the same team and generally the same actual people, seemed to produce some of the most excellent and usable logging I've ever seen. Without needing a whole bunch of rules to try and force everyone to (still fail to) get there.
One neat thing that I think really facilitated this was the sense of empowerment that came with having exactly one rule (logging isn't configurable) combined with one goal (keep the system up). We did decide we wanted smart logging along those lines. And we did see that existing solutions didn't support this very well. So someone wrote one. And it was so dead simple, and easy to use. The 'user manual' for new hires was basically, "Here's 50 lines of code that you should read."
I wish we could all be so lucky as to only care "when an error is detected." Logging is about creating breadcrumbs that can be searched and cross-referenced to piece together what happened when no error is detected, but the behavior is nevertheless suspect.
It's not a question of number of machines, it's a question of how much load any individual program can serve. Requests per second, per process. If that's O(1k) or less, then sure, do whatever you want, it's trivial load.
"customer" is hard to describe since the buyers are very large organizations. Think multinationals, entities which have air forces and space programs, etc.
My colleagues love to log as little as possible and most of the projects I’ve seen still treat logs as files instead of event streams that could have some search and filtering and categorization and automated alerting.
It’s kind of unfortunate, because for example there’d be pushback against logging branches in code etc., except for trace logs (that others wouldn’t add) that are also off most of the time when problems actually happen. It does help a lot in personal projects though, albeit the limited traffic there kinda minimizes any problems that ample logging might otherwise cause.
At least it’s possible to move in the direction of adding some APM like GlitchTip or Skywalking.
I've had colleagues try this. It rarely works. Logging every if end up introducing a huge amount of overhead, both in terms of processing power, but especially in terms of storage. You almost always end up having to filter based on some sort of log level that you then turn off by default in production.
The problem with that is that you're now required to reproduce the issue after turning on the logging, and if you already have a reproducer, why not just attach a real debugger?
The overlap of "we can reproduce" but "it has to run on the production server" ends up being practically zero.
Probably due to no direct access to the environments, whereas there being less organizational resistance to toggling logging levels... assuming that the issue is even easily reproducible instead of "Okay, this happened once and cost us X$, it was pretty horrible, we don't know exactly why that happened but fix it before it happens again."
Something I've done in the past is send some logs to BigQuery for cheap mass storage and others to Grafana for fast querying and use in live dashboards. Basically a filter rule in our logging agent to send different events to different destinations. I think with some more hacking I could get both datasources into the same Grafana frontend...
Often logging backends have a sampling rate to take care of this for you. As it is almost certainly better to setup a buffer in your logging layer and deal with this there, than to try and do this everywhere in your code.
This feels like one of those "not obvious until you've seen it in production" requirements: any production-ready logging framework should have a mechanism to delay parameter evaluation until after the threshold/destination checks are performed. Most languages have some version of deferred execution (lazy evaluation, thunks, etc.)
Putting a few important metrics in the logs every 10s is something that the Aerospike datastore also does (https://aerospike.com/docs/database/observe/latency). This is useful because when you contact support, they run a script to generate a table of historical latencies from the log without depending on you having set up Prometheus, CloudWatch, etc.
I agree with this. Logging, as well as metrics and tracing, are such hard topics for me to wrap my head around though.
From the log consumer (person) perspective, you'd want logs to provide you with sufficient information when troubleshooting. But since trouble usually happens when things go wrong in unexpected ways, the logging likely won't be well aligned to emit the right info for you to figure out what's going wrong exactly. What then, are you supposed to log the entire application state and every change to it? But then that's way too expensive, and there's a decent chance you might just drown in the noise instead. So you're left with this half artform half science type deal.
One thing I'm grateful for is that over the years most everything now logs in JSON lines at least. I just wish there was a standardized, simple way to access all the possible kinds of JSON objects that might be emitted into the logs. A schema would be a good start, but then I can immediately see ways how that would be quickly rendered lot less useful early on (e.g. "this and that field can contain some other serialized JSON object, good luck!").
Everything is events. The problem is that, as you notice, you frequently encounter situations where there are too many events to handle. Metrics, logging, and tracing are just three different ways to handle that problem.
Metrics handles too many events by aggregating them. You handle too many events by squashing them into a smaller number of events that aggregate the information.
Logging handles too many events by sampling them. If you have N times as many events as you can handle, take 1 in N of them or whatever other sampling model you want.
Tracing is logging, but where you have chains of correlated events. If you have a request started and a request ended event, it is pretty useless to get one without the other. So, you sample at the "chain of correlated events" level. You want 1 in N "chains of correlated events".
But, if you have enough throughput for all your events, just get yourself a big pile of events and throw it into a visualizer. Or better yet, just enable time travel debugging tracing so you do not need to even need to figure out how the events map to your program state.
> What then, are you supposed to log the entire application state and every change to it?
For replayability/state reconstruction, usually it's enough to log the input data and the decisions made upon them i.e. which branches of the if/switch (and things morally equivalent to them e.g. virtual functions and short-circuiting Boolean operators) you've actually taken.
> But then that's way too expensive,
Yes, it's usually still way too expensive. But when it's not, it does give you information about at what code point exactly the "wrong" decision was made, and from there you can at least start thinking about how the system could get into the state where it would start making "wrong" decisions at this precise point of code — and that usually cuts down the number of possible reasons tremendously.
> I just wish there was a standardized, simple way to access all the possible kinds of JSON objects that might be emitted into the logs. A schema would be a good start ...
While not an industry standard, an open source specification for JSON log entries commonly used is ECS[0]. There are others, but this one can serve a system well IMHO.
My personal answer to this is logging very little during normal operation and then logging a lot during errors. Depending on the maturity of the system “a lot” might mean the entire state so I can debug afterwords.
Even better: log absolute total counts of received and finished events. You can easily extract the rates from that, and you'll know if the process builds up a lot of simultaneous processing, and you can more easily compute longer-term averages, and you know if it is starved for work or resources, etc.
I think, if you work with logging long enough this becomes common sense at some point. Consider when logging backend performance or subscription gets hit by bursts - e.g. an unlikely code path gets repeatedly hit that logs an error or in the worse case an exception trace. With logging every X occurrence you reduce the total amount, but don't have control about the overall rate. With logging every X seconds, you can manage the rate better, which is useful depending on the logging backend performance or subscription.
There is an additional benefit to throttling by time, it is a lot easier to do it efficiently in multithreaded environments.
If you log by count, you need a global counter for that event (you could do thread-local, but then your logging volume would depend on the number of threads). If the code path is hot (which may be the case if you want to throttle your logs) multiple threads will contend on the increment, and that can be very expensive.
If you log by time, you just need a load and a clock read (on Linux, `CLOCK_MONOTONIC_COARSE` is a handful of ns and the resolution is enough for this purpose), and only need synchronization (a compare-and-swap) when the timer expires, so threads virtually never interfere with each other.
For high-concurrency scenarios, sharded counters (per-thread counters with occasional global sync) or probabilistic logging (log with probability 1/N) can also solve the contention issue while maintaining count-based semantics. These approaches give you deterministic volume without the CAS overhead when timers expire.
That still means each thread will do its own separate log call every second (or whatever the period is) instead of all threads aggregating into a single log call.
No, the timer is still global (that's why you need the compare-and-swap). But the threads only need to do reads most of the time, and reads do not cause contention. Writes do.
It looks something like this (pseudocode):
static std::atomic<uint64_t> deadline{0};
auto now = coarse_clock::now();
auto curDeadline = deadline.load(std::memory_order_relaxed);
if (now >= curDeadline &&
deadline.compare_exchange_strong(curDeadline, now + period, std::memory_order_relaxed)) {
// Actually log
}
Your “actually log” is within one thread. Either the threads all do their separate “actually log”, or they have to synchronize their data into a single shared “actually log”.
> Either the threads all do their separate “actually log”
But why? Often the purpose is just to log a "been here" signal with some additional details for diagnostics. You don't need to include an accumulation of everything that happened since the last log. All that you care about is that the log happens at most 1/period, say once per second.
If you do want to also log some data that accumulates everything that happened, you can accumulate the data in thread-local buffers, and in the "actually log" part collect all the buffers and log them. Since this only happens in the thread that "wins" the CAS, it is still very scalable. This is a very common technique.
If you throttle by count, you cannot avoid the contended atomic increment (you can with some sophistication and at the cost of some approximation).
Which requires some sort of synchronization (or lock-free data structures), because of concurrent writes by other threads. In that situation, you can also simply use a dedicated thread to periodically flush the log buffers.
Yes, but crucially, only 1/period, not on every single "should log?" call, which is what I was referring to above. The per-thread mutexes are uncontended virtually all the time.
I think there's a bug in the sample code, because the claim is to have a "consistent log rate": In the example, "read_event_from_queue()" should be blocking. If there are no items read from the queue within the given time interval, the logging part will never trigger, so the time-based logging does not have a consistent log rate.
Besides the (potential) bug, it is a cool idea, if emitting metrics is not an option.
I think count based logging works really well for batch processing jobs where each item is a fairly constant amount of work. The log shows the time you started and prints another line (with the time) for each 1% of the batch. You can see the progress and guess when it will be done.
The human who is debugging an issue can see when we started, see that some processed successfully, see regular progress through the batch, then see that the 58th percentile batch hasn’t completed and that’ll be where the problem is.
The main benefit over the time based logging is that the code is much simpler, and the log output is simpler too.
There are even libraries like tqdm that do this for you in one line of code.
I think I'm missing something. I am not a SWE but when I'm looking at logs I'm not really all that interested in the number of events, but instead I want to know what those events are. If I need the number, that's something that my SEIM or BI should counting.
The practical problem with logging by time is that it's not resource constrained: holding N seconds of logs, even when each line is a bounded size, takes potentially unlimited memory. Logging 'by count' used a bounded amount of memory, and is easy to implement with a fixed size array in memory.
You're talking about a different scenario than than the article. The article is about a strategy of how to generate a single log line. You're talking about a strategy of how to batch multiple log lines together.
One way to reframe this is: "as a user [of the logs], what might I want to know?"
In my experience, this post is often right (and the logs are often wrong). There's a tendency to either log too much or log too little - if only a few items are getting processed, it's fine and maybe even good to log all 7 of them.
But if many, many are getting processed - you'll experience semantic overload as a reader of the logs. What you want is a compressed form
Logging per time interval can be a very handy approach. In my work, we've settled on a hybrid approach - calculate in real time how often things are happening and then log the number of things that have happened, but at a rate that is roughly one log every N seconds.
This takes some more engineering up front but is remarkably often what a log reader actually wants.
Btw: logging and the ability to observe systems was the single and most successful act I have done in my career as a Data Product Manager.
I have had to fight, do tricks and ‘play the game’ so much.
I cannot stress the importance of understanding atomic movements.
The cost is high but not as high as the cost of not knowing.
Logging is an exercise in solving problems of future you.
We don't log just to have records of everything, we log to solve future questions.
I'm working on a system that generates huge numbers of log entries and have settled on a short term solution to over log everything.
Once a log entry has persisted, I'm using it as a 'Bronze Layer' in a typical Medallion Model and will then filter that log data up into Silver and Gold layers so I can have billing, reporting, dashboard metrics being lifted out of the verbose logs.
Not sure what I'll do with the verbose Bronze Layer logs maybe cold store them somewhere, but it's interesting to experiment with progressive aggregation of logs to hopefully purge and dispose of the raw log data as fast as we can extract value.
the OP wrote "This is a simple concept, but I've never seen it written down before." Putting on my tin-foil hat:
Maybe that's because predictable log volume isn’t in the vendor’s interest. Time-based logging makes usage easier to reason about. Bursty, count-based logs? Much harder to estimate—much easier to monetize.
I wonder how do they log mission critical things in general. For instance, how often does a flight data recorder (FDR) log every state of mechanical components? Surely, they can't wait until something "interesting" to happen, right?
> I wonder how do they log mission critical things in general. For instance, how often does a flight data recorder (FDR) log every state of mechanical components? Surely, they can't wait until something "interesting" to happen, right?
There are different types of logging.
What you describe could be defined as an audit log intrinsic to system operation, which is quite a different thing than what the article describes.
No, please don't use logs to deduct whether your application is running.
Provide an endpoint which presents health information and use infrastructure-level metrics.
It should be both. A heartbeat monitor can log information at periodic events while warnings, errors and other high priority events are logged as soon as they occur. All log entries should be time-stamped regardless of the logging frequency.
I think "time based logging" is a bit of an anti-pattern, as that's LITERALLY a metric. The thrust of this is "sample at a consistent rate" which, yeah sample your numbers at a consistent rate. But also, probably don't use logs for this, probably use a metrics tracking system.
If you want to know if an application is running, implement health checks. I hope I never have to deal with the pattern suggested in this article in a production system.
Log interesting things, where interesting is defined as context outside what the "happy path" execution performs.
Collect and make available system metrics, such as invocation counts, processing time histograms, etc., to make available what the post uses log statements to disseminate same.
In the end, what's the difference between a log and a metric? Is one structured, and one unstructured? Is one a giant blob of text, and the other stored in a time series db? At the moment I guess I'm "logging my metrics" with structured logs going into Loki which can then unwrap and plot things.
You and the other commenters have given me the vocabulary to dig more into this area on the internet though. Thanks!
A log entry should capture an event in time, for example: a person logging in, a failure, a record of a notable event occurring, etc. These should be written at the time they occur when possible, to minimise chance of loss and to minimise delay for any downstream systems that might consume the logs. Arguments for batching could easily be made for systems generating very high volumes of logs.
Conversely, a metric is a single value, point-in-time capture of the size of something, measured in units or with a dimension. For example: current queue depth, number of records processed per second, data transfer rate in MB/s, cpu consumption percentage, etc. These can/should be written periodically, as mentioned in TFA.
> In the end, what's the difference between a log and a metric?
Don't let me put you down, but writing a logging advisory blog post when you don't know the difference between a log and a metric seems like a peculiar thing to do.
But I'm not shaming lack of knowledge, we all had to learn somehow.
Very few people do that though.
Essentially, a log entry is the emission of state known by an individual code execution path at the point the log entry can be produced, whereas a metric is a measurement of a specific runtime execution performed by the system.
For example, a log entry of:
Emits a log entry capturing the processing state known when the statement is evaluated. What it does not do is separate this information (a time-based attribute in this case) from other log entries, such as "malformed event detected" or "database connection failed."More importantly, putting metrics into log entries forces timing to include log I/O, requires metrics analysis systems to parse all log entries, and limits the type of metrics which can be reported to be those expressible in a message text field.
Maybe most important of all, however, is that metrics collection and reporting is orthogonal to logging. So in the example above, if the log level were set to "error", then there would be no log-based metric emitted.
> What it does not do is separate this information
Logging at scale should really be structured, which means that you can trivially differentiate between different types of log message. You also get more dimensions all represented in that structure.
> limits the type of metrics which can be reported to be those expressible in a message text field
This is another example, logging shouldn't be text based ideally. You might have a summary human readable field, but metrics can easily be attributes on the log message.
The more I work in this area the more I'm realising that logs and metrics are pretty interchangeable. There are trade-offs for each absolutely, but you can convert logs into metrics easily (Datadog does this), and with a bit more effort you could turn a metric into logs if you wanted to (querying metrics as rows in a SQL database is handy!).
Metrics collection is also not necessarily orthogonal to logging, it depends on your system. From a server, you might have logs pushed to an external source and metrics pulled from the server by Prometheus, but that's just implementation details. You can also have logs pulled from log files, and metrics pushed to a statsd endpoint.
I've worked on mobile apps where metrics get aggregated locally and then pushed as log events to the server with one log event per metric and dimension set, only for the server to then typically turn them back into metrics.
It's good to understand the tradeoffs, the technology, whether you're using push or pull, where data is spooled or aggregated, data costs, etc. But this stuff is all pretty malleable and there's often no clearly right answer.
In my local fun projects that run on my machine I might dump metrics into the logs because it's practical, but it doesn't make it "right".
Various companies, both in-house for big tech and then making this more widely accessible, started to answer this question by saying "pump all your individual logs in structured form into a giant columnar database that can handle nearly arbitrary numbers of columns, and we'll handle letting you slice and dice metrics out of any combination of columns you want. And if you have an ID follow the session around between different microservices, and maybe even all the way to the browser session, you can track the entire distributed system."
Different people might say that Datadog, Honeycomb, or Clickhouse (and the various startups backed by Clickhouse as a database) were the ones to make this pattern mainstream, and all of them pushed the boundaries in one way or another - nowadays, there's a whole https://opentelemetry.io/ standard, and if you emit according to that, you can plug in various sinks made by various startups, and choose the metrics UX that makes the most sense for your use case.
I'm a huge fan of Honeycomb - when I know a certain issue is happening, I can immediately see a chart showing latencies and frequencies, and click any hot spot to filter out the individual traces that exhibit the behavior and trace the end-to-end user journey, with all the different logs from all the systems touched by that request. And I can even begin this discovery from a single bug report by a single user whose ID I know. It's not just metrics - it's operational support. And if I'd pre-aggregated logs, I'd have none of this.
But of course, there are systems where this doesn't make sense! Large batch jobs, high-performance systems with orders of magnitudes more events than a standard web application... it's not one size fits all. That said, I think knowing about modern observability should be part of every developer's toolkit.
Usecases:
Log: search, get context, read
Metric: measure, plot dashboards, define alerts
My theory for the concepts being so mixed up together: you use both to troubleshoot, and I think the old school way to emit metrics was to parse logs and turn that into measures.
There are a few ways to slice this, but one is that logs are human-readable print statements and are often per-task. E.g. if you have 100 machines, you don't want to co-mingle their logs because that will make it harder to debug a failure. Metrics are statistics and are often aggregated across tasks. But there are also per-task metrics like cpu usage, io usage etc.
They can both be structured to some extent. Often storage strategies might differ but not necessarily. I think at Google the evolution of structured logging was probably something like (1) printf some stuff, (2) build tooling to scrape and combine the logs, (3) we're good at searching, but searching would be easier if we just logged some protos.
I think logs are basically self-explanatory since everything logs. To understand why you would want separate metrics, consider computing the average cpu utilization for your app across a fleet of machines. You don't want to do that by printf the CPU usage, grep-ing all the logs, etc. You could try to do that with structured logs, and I'm sure some structured logs SaSS companies would advocate that.
If you're new to this space, I really liked the book Designing Data-Intensive Applications.
There's also tracing
If you need to "log" something to give users feedback as the system is running, it may be less of a log and more of a progress or status output.
Logs to me are things which happen and I want to be able to trace later, so summarizing or otherwise dropping logs that come in quickly in succession would be a problem. If I need to filter I pipe to grep, otherwise I can just save it all and read through it later.
Status messaging, which may be informative about your process is useful, and if its goal is to be observed real-time, then yea. A message or two a second seems like a good goal for consistency.
These are just two very different use cases to me. And generally I find the former critical to get right, while the later may be nice to have and may lead to discovery by nature of making it more accessible.
In essence:
Logs mark some event in the system.
Metrics model some measurable, quantifiable state.
In high volume systems both can then be observed through various sampling techniques. A key item is that sampling is good to handle separately to application logic creating those signals as it may change over time or be dynamic.
> The moment of capturing a measurement is known as a metric event
Which suspiciously reads like a log.
In practice, a metric is an aggregate of events (the "metric events") when you're not interested in the individual event but, but in the aggregate itself. For practical reasons this is not implemented with logs but with more primitive technical events emission.
This is not fundamentally incompatible notions. If you do an electrocardiogram, you might be interested in your BPM, but it is deduced by the full log of each beat. The segregation we do in computing is more practical than fundamental.
Logs should be bursty, because they're most useful when debugging rare issues. If you have identical log lines, then that should have been a metric instead.
Metrics should be sampled based on frequency, because they deduplicate. I'm a huge fan of logarithmically sampling metrics.
One of the huge missing things in metrics systems, imho, is keeping granular metrics in the context of a business operation and then using late aggregation for trends. Last I looked nearly every metrics systems either logged individual events and and required processing for any rollup or aggregated too early and you couldn’t determine the effect on any individual operation/request. There’s a happy medium where you can get per-request counts, stats, and timing and still roll those up at the host/data center/region/granularity to get higher level trends.
Most metrics APIs are incompatible with this idea, however.
It's good to save metrics for things that remain true under arbitrary aggregations. E.g. sum, count, maximum and avoid things that do not survive aggregations such as percentiles.
This isn't as much "conflating" as it is constructing an ad hoc metrics subsystem that exports the metrics to the logs.
There's no theoretical difference between exposing a prometheus endpoint that's scraped every x seconds and printing the same data to the logs every x seconds.
You can worry about data retention, rollups, and other strategies for limiting data storage separately from the systems that emit the data.
At least with the right data stores. I kind of like what opensearch and elasticsearch do for this. In Elasticsearch you have a data stream. You configure it to roll over based on time or data size. Once rolled over, indices are read only; new data appends to the current one. You then can define life cycle policies to decide what to do with the old ones and e.g. move them to cold storage, transform them with rollups, and eventually delete them.
With application logging, you typically assign different log levels. Trace and debug are typically disabled in production (or should be). Info can be quite noisy. Warn tends to be repetitive (because developers tend to ignore warnings and will never fix them). Errors should be rare.
I have my system configured to start emailing me if errors get logged. An error means something is broken and needs to be fixed. Zero tolerance on errors. When an error happens, all the other log information provides me context. So there's value in retaining that. But only for a few days at best. Long enough to survive a weekend or things like Christmas. But after that it's just noise. I have a hard cut at about two weeks. Some places you need to store stuff longer for ass coverage reasons.
Data retention comes at a price of course. I've seen companies log ginormous amounts of data and ignoring all their errors. 30GB per day. Absolutely appalling. Me: it looks like your database layer is erroring non stop (constraint violations and worse); you might want to do something about that. Them, ah no that's just normal we just ignore it (php shop, incompetence was the norm). Me: so how do you know when something breaks?! Them: ......?!
My well paid consulting gig was beating some sense into this operation as one of the managers noticed they were spending hundreds of thousands per year on this nonsense. My fee was a rounding error on that. Easiest job ever. But kind of cringe worthy once I started looking into what they were actually doing and why. Mostly it's just, "yeah some guy set that up once and then we never looked at it and he left. What are you going to do?!". There was a lot of that with this company. Just absolutely nobody that even cared about the waste of resources or getting any meaningful feedback from their logging. If that's your team, you need to do something about it. That's your job and your not doing it well. If you need an external consultant to tell you, you might want to reflect on the notion of majorly shaking things up a bit.
Metrics should be emitted in separate stream and never by logs outside corner cases. Logs should be used to determine WHY the system is having issues but never IS the system having issues.
Log alerting is a fools errand that looks like a great idea at start but quickly becomes a sand trap that will drive future people crazy and at scale, will overwhelm systems.
Why is log alerting bad idea?
Every log becomes a metric point that must be dealt with. Therefore, the logging system must be kept operational and error free. However, due to other problems below, this system quickly becomes a beast of it's own.
Logs are generally much bigger then KV of <Metric> <Value> so there ends up being a ton of filtering going on in logging system, adding to the load.
Logging system probably does not understand rates so you end up writing gnarly queries to be like "Is this first unhandled exception?" in 10m or my 50th in 10m. Query in Prometheus is much much simpler.
Each language logging library handles things in different way so organization must be on point to either A) Keep log format the same between all different languages. B) Teach the logging system how to manipulate each log into format that can be handled by alerting system. Obviously A causes massive developer friction and B causes massive Ops friction.
Finally, I find people doing logging tend not handle exceptions as well because they can just trust logging system to alert them on specific problem and deal with it manually.
So for future Ops person who has to deal with your code, I'm begging you, import prometheus_client.
Where exactly does this anti-logs sentiment come from? Is it because tools like datadog can be lackluster for reading logs across bunches of hosts?
If you have good metrics, you can generally get much further not even logging aggregating outside tossing everything into STDOUT and checking on it when you have alerts.
It's claiming that you should output working logs, metrics, and failure logs into different streams.
A simple fixed-length rolling buffer can get you pretty far for logging, but it isn't something you necessarily want to get off-device except when something bad has happened.
I imagine many people learn on an environment like this and get thrown in a high volume one without chance to adapt.
The only one I'll add is: If your logs are usually read in a log aggregator like Splunk or Grafana instead of in a console or text file, log as JSON objects instead of lines of text. It makes searches easier.
For Python users, there's a "logfmter" package which is enormously more straightforward than the popular "structlog" one.
What we really need is smart logging: only log the full span when an error is detected, otherwise no need for it. But it's not a very well supported case.
That, combined with a real, genuine devops scheme where the people implementing the system and the people keeping it running in production where on the same team and generally the same actual people, seemed to produce some of the most excellent and usable logging I've ever seen. Without needing a whole bunch of rules to try and force everyone to (still fail to) get there.
One neat thing that I think really facilitated this was the sense of empowerment that came with having exactly one rule (logging isn't configurable) combined with one goal (keep the system up). We did decide we wanted smart logging along those lines. And we did see that existing solutions didn't support this very well. So someone wrote one. And it was so dead simple, and easy to use. The 'user manual' for new hires was basically, "Here's 50 lines of code that you should read."
I wish we could all be so lucky as to only care "when an error is detected." Logging is about creating breadcrumbs that can be searched and cross-referenced to piece together what happened when no error is detected, but the behavior is nevertheless suspect.
That's clever and I'll definitely use it in the future.
It’s kind of unfortunate, because for example there’d be pushback against logging branches in code etc., except for trace logs (that others wouldn’t add) that are also off most of the time when problems actually happen. It does help a lot in personal projects though, albeit the limited traffic there kinda minimizes any problems that ample logging might otherwise cause.
At least it’s possible to move in the direction of adding some APM like GlitchTip or Skywalking.
The problem with that is that you're now required to reproduce the issue after turning on the logging, and if you already have a reproducer, why not just attach a real debugger?
The overlap of "we can reproduce" but "it has to run on the production server" ends up being practically zero.
Probably due to no direct access to the environments, whereas there being less organizational resistance to toggling logging levels... assuming that the issue is even easily reproducible instead of "Okay, this happened once and cost us X$, it was pretty horrible, we don't know exactly why that happened but fix it before it happens again."
(not an exact case, but a pattern I've seen)
Or logfmt which is easier to read for humans, lower overhead, and is still structured and supported in at least Grafana/Loki for parsing and queries.
In this particular example, I agree with others: this is a case for metrics. "Log errors, metric successes[0]."
0: success events (a bit more than a log typically) may be important, especially if tied to something you charge for.
From the log consumer (person) perspective, you'd want logs to provide you with sufficient information when troubleshooting. But since trouble usually happens when things go wrong in unexpected ways, the logging likely won't be well aligned to emit the right info for you to figure out what's going wrong exactly. What then, are you supposed to log the entire application state and every change to it? But then that's way too expensive, and there's a decent chance you might just drown in the noise instead. So you're left with this half artform half science type deal.
One thing I'm grateful for is that over the years most everything now logs in JSON lines at least. I just wish there was a standardized, simple way to access all the possible kinds of JSON objects that might be emitted into the logs. A schema would be a good start, but then I can immediately see ways how that would be quickly rendered lot less useful early on (e.g. "this and that field can contain some other serialized JSON object, good luck!").
Metrics handles too many events by aggregating them. You handle too many events by squashing them into a smaller number of events that aggregate the information.
Logging handles too many events by sampling them. If you have N times as many events as you can handle, take 1 in N of them or whatever other sampling model you want.
Tracing is logging, but where you have chains of correlated events. If you have a request started and a request ended event, it is pretty useless to get one without the other. So, you sample at the "chain of correlated events" level. You want 1 in N "chains of correlated events".
But, if you have enough throughput for all your events, just get yourself a big pile of events and throw it into a visualizer. Or better yet, just enable time travel debugging tracing so you do not need to even need to figure out how the events map to your program state.
Somewhat formalized: https://peter.bourgon.org/blog/2018/08/22/observability-sign...
For replayability/state reconstruction, usually it's enough to log the input data and the decisions made upon them i.e. which branches of the if/switch (and things morally equivalent to them e.g. virtual functions and short-circuiting Boolean operators) you've actually taken.
> But then that's way too expensive,
Yes, it's usually still way too expensive. But when it's not, it does give you information about at what code point exactly the "wrong" decision was made, and from there you can at least start thinking about how the system could get into the state where it would start making "wrong" decisions at this precise point of code — and that usually cuts down the number of possible reasons tremendously.
While not an industry standard, an open source specification for JSON log entries commonly used is ECS[0]. There are others, but this one can serve a system well IMHO.
0 - https://www.elastic.co/docs/reference/ecs/ecs-guidelines
If you log by count, you need a global counter for that event (you could do thread-local, but then your logging volume would depend on the number of threads). If the code path is hot (which may be the case if you want to throttle your logs) multiple threads will contend on the increment, and that can be very expensive.
If you log by time, you just need a load and a clock read (on Linux, `CLOCK_MONOTONIC_COARSE` is a handful of ns and the resolution is enough for this purpose), and only need synchronization (a compare-and-swap) when the timer expires, so threads virtually never interfere with each other.
It looks something like this (pseudocode):
But why? Often the purpose is just to log a "been here" signal with some additional details for diagnostics. You don't need to include an accumulation of everything that happened since the last log. All that you care about is that the log happens at most 1/period, say once per second.
If you do want to also log some data that accumulates everything that happened, you can accumulate the data in thread-local buffers, and in the "actually log" part collect all the buffers and log them. Since this only happens in the thread that "wins" the CAS, it is still very scalable. This is a very common technique.
If you throttle by count, you cannot avoid the contended atomic increment (you can with some sophistication and at the cost of some approximation).
> collect all the buffers
Which requires some sort of synchronization (or lock-free data structures), because of concurrent writes by other threads. In that situation, you can also simply use a dedicated thread to periodically flush the log buffers.
Besides the (potential) bug, it is a cool idea, if emitting metrics is not an option.
The human who is debugging an issue can see when we started, see that some processed successfully, see regular progress through the batch, then see that the 58th percentile batch hasn’t completed and that’ll be where the problem is.
The main benefit over the time based logging is that the code is much simpler, and the log output is simpler too.
There are even libraries like tqdm that do this for you in one line of code.
In my experience, this post is often right (and the logs are often wrong). There's a tendency to either log too much or log too little - if only a few items are getting processed, it's fine and maybe even good to log all 7 of them.
But if many, many are getting processed - you'll experience semantic overload as a reader of the logs. What you want is a compressed form
Logging per time interval can be a very handy approach. In my work, we've settled on a hybrid approach - calculate in real time how often things are happening and then log the number of things that have happened, but at a rate that is roughly one log every N seconds.
This takes some more engineering up front but is remarkably often what a log reader actually wants.
Metrics enable the ability to aggregate concepts into some kind of meaning.
Meaning can then have alerts associated to them.
You cannot create metrics on things you don’t know, which is why logging is the base.
I cannot stress the importance of understanding atomic movements.
The cost is high but not as high as the cost of not knowing.
We don't log just to have records of everything, we log to solve future questions.
I'm working on a system that generates huge numbers of log entries and have settled on a short term solution to over log everything.
Once a log entry has persisted, I'm using it as a 'Bronze Layer' in a typical Medallion Model and will then filter that log data up into Silver and Gold layers so I can have billing, reporting, dashboard metrics being lifted out of the verbose logs.
Not sure what I'll do with the verbose Bronze Layer logs maybe cold store them somewhere, but it's interesting to experiment with progressive aggregation of logs to hopefully purge and dispose of the raw log data as fast as we can extract value.
Maybe that's because predictable log volume isn’t in the vendor’s interest. Time-based logging makes usage easier to reason about. Bursty, count-based logs? Much harder to estimate—much easier to monetize.
There are different types of logging.
What you describe could be defined as an audit log intrinsic to system operation, which is quite a different thing than what the article describes.