Stop Forwarding Errors, Start Designing Them

(fast.github.io)

90 points | by andylokandy 19 hours ago

17 comments

Rygian 17 hours ago
Sorry for the small digression. It's on topic.
Just a few minutes ago, while copying 63 GB worth of pics and videos from my phone to my laptop, KDE forwarded me the error "File <hard to retain name.jpg> could not be opened. Retry, Ignore, Ignore all, Cancel".
This was around file 7000 out of 15000. The file transfer stopped until I made a choice.
As a user, what am I supposed to do with such a popup?
It seems like a very good example of "Eror Handling Without Purpose" as the article describes, but at user level.
Except that here, the audience is "a plain user who just dragged a folder to make a copy" and none of the four options (or even the act of stopping the file transfer until an answer is chosen) is actually meaningful for the user.
The "Putting It Together" for this scenario should look like: a non-modal section populates with "file <hard to retain name.jpg> failed due to reason; at the end of the file transfer you'll get a list with all the files that failed, and you'll have an option to retry them, navigate to their source position to double-check, and/or ignore".
[-]
- grumbel 5 hours ago
  > As a user, what am I supposed to do with such a popup?
  Change the floppy disk. In the MSDOS days those messages were useful, as read errors might be caused by having the wrong floppy in the drive. The OS had no way to know when the floppy was changed and "Retry" allowed you to swap the disks back and try again. In modern days it is less useful, the behavior just got carried over.
  Windows addresses this issue somewhat by scanning the directory tree before the actual copying starts, this can catch some errors before they happen and gives you better progress reporting on top.
  But a single dialog that keeps track of the whole copy/move operations, not a modal dialog attached to individual read/write calls would be the way to go here. This is a case of the GUI sticking to close to what the OS is doing instead of what the user intended to do.
  [-]
  - 1718627440 53 minutes ago
    > Windows addresses this issue somewhat by scanning the directory tree before the actual copying starts
    Which really sucks because no you need to wait for minutes before it actually starts moving or deleting. I generally just abort, start the midnight commander or just invoke mv/del directly.
    > But a single dialog that keeps track of the whole copy/move operations
    Which is what is the case here? The question and buttons appear in that dialog.
    [-]
    - grumbel 25 minutes ago
      > The question and buttons appear in that dialog.
      The error/retry dialog is for the failure of moving an individual file, not for a failure of the move operation as a whole. Those individual error dialogs provide no means to deal with cascading errors. All you can do is "Skip All", but that means you get no further information on errors anymore.
      The error reporting should be part of the Moving dialog itself and provide a list of everything that failed in the move, along with potential ways to resolve it. More detailed reporting than "Could not read" would also be welcome (io, permission, ...).
- XorNot 17 hours ago
  This design still doesn't work: what if the user walks away and the computer is powered off in the meantime?
  I.e. you need to write the report of this to a file itself. In fact you should allocate a decently large file upfront to make sure you can write the report and the error message (out of disk space for example).
  [-]
  - Rygian 17 hours ago
    It goes quite far, actually.
    A file transfer should remain active even if both devices (source, destination) are physically disconnected, or in network partitions, or when devices are full, need media change, etc.
    The only valid states for a file transfer are: ongoing, fully completed with 100% success, or explicitly cancelled by the user with a full usable report of what got copied, fully or partially, and what did not get copied.
    The file transfer dialogs and tooling of today's mainstream computing are stuck in the nineties.
    [-]
    - yetihehe 5 hours ago
      Then you will have another control panel or log of ongoing file transfers, which will accumulate waiting transfers over the years a device was used.
  - throw-the-towel 17 hours ago
    And what if the computer is kidnapped by the US Army while it's copying the files?
    You just can't defend against everything, but an imperfect solution can still be an improvement over the status quo.
    [-]
    - vineyardmike 13 hours ago
      > kidnapped by the US Army.... You just can't defend against everything
      Of course not.
      The litmus test IMO should be "what would a normal intelligent human do in this situation?"
      A human would copy every file it could, maintaining a list of issues. When you were available to address concerns, it'd present the options to you. The human would give up if the US Army showed up, but a human would restart a TCP connection automatically without asking for permission again (or more analogously, redial a phone call). A human would save their work automatically, and when you showed back up, would find that work for you.
      (In 2026, things like "retry" should be automatic outside some very specific limitations too, because of course a human would try again if they failed).
      [-]
      - 1718627440 49 minutes ago
        > A human would copy every file it could, maintaining a list of issues.
        Please not, I want my computer to be a dumb tool, who really only does what I told it to. I do not want to have it have it's own agenda.
        > In 2026, things like "retry" should be automatic outside some very specific limitations too
        No. I can tell the computer to retry, when I didn't it is because I didn't want it to.
      - yetihehe 5 hours ago
        > what would a normal intelligent human do in this situation?
        Problem is that this requires testing what actual "normal intelligent human" would do, because very often programmer has other ideas and UI/UX people have other ideas.
        > A human would copy every file it could, maintaining a list of issues.
        How do you know? From your idea what should be done instead of current version? I would not do it like you said.
        Also, there are many reasons for transfer not succeeding and depending on a reason why transfer didn't succeed, you should make different decisions. sometimes reasons are not predictable by a program (a new file transfer method over pidgeons was transparently added to the system and "carrier attacked by predator" was not included in "how to handle this reason").
    - XorNot 16 hours ago
      No, but imagine doing all the work to collect up a list of files that failed only to say, pop a modal at the end of the process that coincides with the user hitting Enter because they were multitasking and it auto-accepts the dialog. Information gone, context lost, in fact your entire design has failed to change the experience at all! All because of one UI overlap that's actually very common.
      We have shared workstations for example where this would be a typical use case for non-tecchnical users across multiple user logins: ensuring you can check that the big data transfer was complete a few hours later would be very useful, but if you only do a fraction of the work for completeness then again, it's of no benefit.
      [-]
      - marcosdumay 10 hours ago
        Yes. The entire reason DEs expect people to dismiss those dialogs is because they are modal. And there's no reason at all for them to be modal.
        KDE even got an entire notifications application, and discovered that it's bad to make them modal. But didn't move away from the idea of dismissing them on any interaction, it still acts like it's a modal.
bccdee 15 hours ago
I'm not sure I like how they're trying to dynamically cast to an error type.
```
  Err(report) => {
      // For machines: find and handle the structured error
      if let Some(err) = find_error::<StorageError>(&report) {
          if err.status == ErrorStatus::Temporary {
               return queue_for_retry(report);
          }
          return Err(map_to_http_status(err.kind));
      }
```
They get it right elsewhere when they describe errors for machines as being "flat and actionable." `StorageError` is that, but the outer `Err(report)` is not. You shouldn't be guessing which types of error you might run into; you should be exhaustively enumerating them.
I'd rather have something like this:
```
  struct Exn<T> {
      trace: Trace,
      err: T,
  }
  
  impl<T> Exn<T> {
      #[track_caller]
      fn wrap<U: From<T>>(self, msg: String) -> Exn<U> {
          Exn {
              trace: self.trace.add_context(Location::caller(), msg),
              err: self.err.into(),
          }
      }
  }
```
That way your `err` field is always a structured error, but you still get a context trace. With a bit more tweaking, you can make the trace tree-shaped rather than linear, too, if you want.
I think actionable error types need to be exhaustively matchable, at least for any Rust error that you expect a machine to be handling. Details a human is interested in can be preserved at each layer by the trace, while details the machine cares about will be pruned and reinterpreted at every layer, so the machine-readable info is kept flat, relevant, and matchable.
[-]
- andylokandy 10 hours ago
  `Exn<T>` preserves the outmost error type and `Exn::<T>::as_error()` will give you the error just the way you want.
  Traversing though the error tree is the worst case where the structured error has been bubbled up through layers until the one who are able to recover from it.
dvogel 17 hours ago
> But as a standard library abstraction, it’s too opinionated. It categorically excludes cases where sources form a tree: a validation error with multiple field failures, a timeout with partial results. These scenarios exist, and the standard trait offers no way to represent them.
This seems akin to complaining that the CPU core has only one instruction pointer. There is nothing preventing a struct implementing `Error` from aggregating other errors (such as validation results) and still exposing them via the `Error` trait. The fact of the matter is that the call stack is linear, so the interior node in the tree the author wants still needs to provide the aggregate error reporting that reflects the call stack that was lost with the various returns. Nothing about that error type implementing `Error` prevents it from also implementing another error reporting trait that reflects the aggregate errors in all of the underlying richness with which they were collected.
oncallthrow 17 hours ago
This is interestingly somewhere where Go really shines, in my experience. Go has no requirement to wrap (or, indeed, even handle at all) errors; yet, despite this, Go codebases I've worked in almost always perform error handling properly (wrapping at each layer of the call stack, so it's easy to identify where an error occurred).
[-]
- spion 13 hours ago
  I don't think there is anything in Go (the language) that helps achieve this - its mostly cultural. (Go creators and community being very outspoken about handling errors).
  In fact, the easiest thing to do in Go is to ignore the error; the next easiest is to early-return the same error with no additional context.
  Technically speaking, Rust has way better tools for adding context to errors. See for example https://docs.rs/color-eyre/latest/color_eyre/
  It does expect you to use `wrap_err` to get the benefits, though. Which is easier to do than what Go requires you to do for good contextual errors, and even easier if you want reasonable-looking formatting from the Go version.
  [-]
  - spion 13 hours ago
    IMO you need both things: culture to make it happen, and technology to make it easy and reasonable looking. Rust lacks the former to some degree; Go lacks the later to some degree (see e.g. kustomize error formatting - everything ends up on a single line)
- jiehong 15 hours ago
  For the flat structure part, it’s much less shiny, though.
  Weirdly, the last time I saw an error in production I couldn’t investigate was because of a go service with no error wrapping… funny coincidence
- hu3 13 hours ago
  It's about incentives. Go makes it explicit.
  And because it's standardised, it's easy to create tooling to flag mishandled errors.
- morshu9001 16 hours ago
  I'd rather have exceptions so this is done for you. Not really an option in Rust due to overhead ofc.
spion 13 hours ago
Great article. Really advances the thinking on error handling. Rust already has a head start compared to most other languages with Result, expect and anyhow (well, color_eyre and tracing), but there was indeed a missing piece tying together error handling "actionability" with "better than stack trace" context for the programmer.
With regards to context for the programmer, I still think ultimately tracing and color_eyre (see https://docs.rs/color-eyre/latest/color_eyre/) form a good-enough pair for service style applications, with tracing providing the missing additional context. But its nice to see a simpler approach to actionability.
vaylian 18 hours ago
I've been thinking about Rust errors as well. We see all these nice tutorials that explain how you can match on an Err and then handle it. But I haven't seen this being done in practise. Most errors are reported directly to the user. There don't seem to be any attempts to automatically handle them.
The cause for an error can be upstream or downstream. If a function fails, because the network is down, then this is a downstream error. The user has not done anything wrong (unless they also are responsible for the network infrastructure). In that case a retry after a few moments might be the right approach. However, if the user provides bad function arguments, then the user needs to be informed, that it's them who need to make corrections. However, it is not always clear if that is the case. If a user requests a non-existing file, then there might be different reasons why the file does not exist (yet).
[-]
- rileymat2 18 hours ago
  I am a bit confused by the network example, even when I don't control the network at the moment I need to do something about it and know about it to act.
  [-]
  - vaylian 5 hours ago
    The software needs to report back to the end user eventually. But if there is a temporary network failure, then the software should automatically retry the request without informing the user (assuming idempotency).
Sytten 17 hours ago
Exn looks very interesting, but to be actionable we need a compatibility layer with thiserror and anyhow since most are using it right now. Moving the goalpost a little we mostly need a core rust solution otherwise your error handling stops at the first library you use that doesn't use exn.
[-]
- tison 12 hours ago
  I think they are almost compatible.
  `thiserror` helps you define the error type. That error type can then be used with `anyhow` or `exn`. Actually, we have been using thiserror + exn for a long time, and it works well. While later we realize that `struct ModuleError(String)` can easily implement Error without thiserror, we remove thiserror dependency for conciseness.
  `exn` can use `anyhow::Error` as its inner Error. However, one may use `Exn::as_error` to retrieve the outermost error layer to populate anyhow.
  I ever consider `impl std::error::Error` for `exn::Exn,` but it would lose some information, especially if the error has multiple children.
  `error-stack` did that at the cost of no more source:
  * https://docs.rs/error-stack/0.6.0/src/error_stack/report.rs....
  * https://docs.rs/error-stack/0.6.0/src/error_stack/error.rs.h...
jiehong 15 hours ago
I suppose Java exceptions have the same issues, albeit with automatic stack traces, obviously:
- the ? keyword is replaced either by runtime exceptions and so each function do it transpires you don’t catch it, or by simply stating the raised exception in the signature
- message can be overloaded for humans
- the exception type itself is the structured data, but in practice it seldom contains structured data and most logic depends on the exception type.
Make of this what you will, but I didn’t say it’s great.
[-]
- imtringued 4 hours ago
  Java has nested exceptions, which significantly reduces the problem, since there is going to be at least one relevant exception that will help you figure it out. In the worst case you can just paste the stack trace into your GitHub issue and call it a day.
  With Rust, having a generic error bubble up without nesting means you don't even know where it went wrong. The error could be from any generic error source.
croemer 16 hours ago
Be warned: LLM writing. Lots of negative parallelisms.
[-]
- tison 12 hours ago
  This is the pull request of this post: https://github.com/fast/fast.github.io/pull/12
  See comments like https://github.com/fast/fast.github.io/pull/12#discussion_r2...
  Quote my comment in the other thread:
  > That said, exn benefits something from anyhow: https://github.com/fast/exn/pull/18, and we feed back our practices to error-stack where we come from: https://github.com/hashintel/hash/issues/667#issuecomment-33...
  > While I have my opinions on existing crates, I believe we can share experiences and finally converge on a common good solution, no matter who made it.
- amelius 14 hours ago
  Speaking of which, why aren't the LLMs solving these low level plumbing problems for us yet?
  [-]
  - croemer 14 hours ago
    Because LLMs mostly follow historical practice. And examples for bad error handling are more common (and easier) than good error handling.
    [-]
    - amelius 14 hours ago
      I'm pretty sure an LLM will be able to handle an instruction such as:
      "Wherever exceptions are thrown, add as much contextual information to the exceptions as possible. Use class RichException<Exception> to store the extra information". Etc. etc.
      [-]
      - croemer 13 hours ago
        Sure, but writing and maintaining such instructions is also work. And not something one thinks about usually until the debugging session with insufficient errors.
- alienbaby 13 hours ago
  What is it you are actually warning me of?
  [-]
  - croemer 13 hours ago
    That it is mostly LLM words which some of us here don't really like to read as it can be low entropy in language, structure, ideas.
- Lvl999Noob 14 hours ago
  Yeah. Certainly felt like that. On the other hand, the content does seem good. It definitely wasn't slop, even if I can't judge how useful it really was (in terms of giving a solution).

Thaxll 16 hours ago

Looks very similar to what Upspin ( Go ) errors look like:

https://github.com/upspin/upspin/blob/master/errors/errors.g...

    type Error struct {
        // Path is the Upspin path name of the item being accessed.
        Path upspin.PathName
        // User is the Upspin name of the user attempting the operation.
        User upspin.UserName
        // Op is the operation being performed, usually the name of the method
        // being invoked (Get, Put, etc.). It should not contain an at sign @.
        Op Op
        // Kind is the class of error, such as permission failure,
        // or "Other" if its class is unknown or irrelevant.
        Kind Kind
        // The underlying error that triggered this one, if any.
        Err error

        // Stack information; used only when the 'debug' build tag is set.
        stack
    }

nchagnet 15 hours ago
I really like the pattern presented in the article. I find myself guilty of designing errors which are useful to me, but maybe not to my user (which tbh in my area is always a bit of a nebulous entity). I really like the idea of separating those two intents, and to make explicit the possible action.
larusso 15 hours ago
Error handling in rust is the number one frustration. I rewrote my errors multiple time. I used error_chain which looked good on paper but was just as broken as thiserror and anyhow. The missing piece is already the fact that no one really defines how to write good and meaningful error types for the different audiences. Even the article described some cases that are highly implementation specific. I will take a look at this other crate the author showed though. The thiserror crate makes it too easy to just foreward errors with the #from / #source implementations. I played around with a helper crate that tries to add a context method to each generated error types. But this as well is optional and also adds tons of overhead.
yxhuvud 5 hours ago
Unreadable due to lag when scrolling. How do you even manage that? Stutters happen on other pages but this was just a delay that was extremely annoying.
fozem 17 hours ago
Good overview on Rust error handling.
I like errors that are unique and trivially greppable in a codebase. They should be stack efficient and word sized. Maybe a new calling convention where a register is reserved for error code and another register is a pointer to the source location string that is stored in a data segment.
The FP fanboy side of me likes the idea of algebraic effects and ADTs but not at the expense of stack efficiency.
[-]
- EPWN3D 17 hours ago
  You basically want a modern errno. I don't mean that as a dig at you -- I've found POSIX error codes to still be the best way to design errors in C. If it can't be evaluated by switch, then it's too complicated.
bheadmaster 18 hours ago
Many Rust programmers despise Go's "if err != nil" pattern, but that pattern actually forces you to think about errors and "design" them to give meaningful messages, either by wrapping them (if the underlying error is expected to provide userful information), or by creating a one from scratch.
It may be easier to just add the "?" operator everywhere (and we are lazy and will mostly do what is easier), but it often leads to problem explained in the article.
[-]
- alembic_fumes 18 hours ago
  Hard disagree. Most of the Go code that I've ever worked with has been littered with one or another variant of the following:
```
  value, err := doFallibleOperation()
  if err != nil {
    return nil, fmt.Errorf("fallible operation failed - %w", err)
  }
```
  That error construct exclusively works for the poor human who has to debug the system, looking at its logs. No call stacks and, crucially, no automatic handling.
  At least with Rust's enums it is possible to make errors automatically actionable. If one skips that part and opts for anyhow because it's too much work, that's really a user problem.
  I like the author's idea of "designing" errors by exposing their actionability in the interface a lot. I'm not overall sold on whether that should be the primary categorization, but at least including a docstring to each enum variant about what can be done about the matter sounds like a nice way to improve most code a little bit.
  [-]
  - Fizzadar 17 hours ago
    As a primarily Go dev - 100% agree. The endless check and wrap error results in long chains of messages you have to grep for to understand the call stack. For what benefit? Might as well just panic and recover/log the stack in many cases.
    [-]
    - morshu9001 16 hours ago
      The error handling is by far my least favorite aspect of Go. It's tedious and dangerous. It should either be like Rust or like JS, there isn't a good third option.
      [-]
      - tcfhgj 16 hours ago
        what about checked exceptions (Java)?
        [-]
        morshu9001 15 hours ago
        Isn't JS the same? But seems like people tend to make a lot of exception types in Java with inheritance, which I think is overkill.
        Typically I'll only have a couple of exception types that my own code throws, like user error vs system error. If I want more detail than that, it goes into the exception payload rather than defining many different types of exceptions.
    - formerly_proven 17 hours ago
      Artisanal callstacks
  - bheadmaster 16 hours ago
    > If one skips that part and opts for anyhow because it's too much work, that's really a user problem.
    If a language makes this more convenient than doing it right, one could argue that the language design is at fault.
  - Thaxll 15 hours ago
    In many code base you have custom errors that implement the error interface ( for http code and the like ), it's very common.
- jayknight 18 hours ago
  >that pattern actually forces you to think about errors and "design" them to give meaningful messages
  Doesn't Rust's Result type(s) force you to do the same? Sure, you can pass them on with the ? operator, but it's still a choice you have to make.
- akdor1154 17 hours ago
  I think that was the intent of Go's design, but in practise i think it normally devolves into an overly verbose '?' with a poorly typed Result<_, String>.
  As a Go dev, I'm looking at this article with great interest. I would very much like to apply this approach to Go as well, I think the author has got a very strong design there.
- tison 12 hours ago
  FWIW, here is a general discussion about error handling in Rust and my comment to compare it with Go's/Java's flavor: https://github.com/apache/datasketches-rust/issues/27#issuec...
  That said, I can live with "if err != nil", but every type has a zero value is quite a headache to handle: you would fight with nil, typed nil, and zero value.
  For example, you need something like:
```
  type NullString struct {
   String string
   Valid  bool // Valid is true if String is not NULL
  }
```
  .. to handle a nullable value while `Valid = false && String = something` is by defined invalid but .. quite hard to explain. (Go has no sum type in this aspect)
atrooo 9 hours ago
As good as the argument is, and the crate may be, I feel like I’ve been lied to when I realize I’m reading an AI generated blog post as is obvious by the end of this one.