- Are you using pg_repack? I'm fairly sure its logic has some holes - last time I checked its bug tracker listed potential for data corruption that could cause issues like this.
- Have you done OS upgrades? Did affected indexes have any columns affected by collations?
- Have you done analysis on the heap page? E.g. is there any valid data on the page? What is the page's LSN compared to the LSN on index pages pointing to non-existing tuples on the page?
- We've used pg_repack in the past, though I'm 90% sure we didn't use it in the timeframe in which this corruption must have happened. Anyway, we'll look into this further: thanks for the suggestion.
- Yes, we've done OS upgrades. Back in 2021, our DB servers were on Debian Buster; they are now on Bookworm. We're aware of the problems caused by collation changes, and indeed that was one of the first things we checked; but we're careful to use the C locale for our database, so believe we're safe on that front.
- For the example we gave (index page 192904826, referencing heap page 264925234), the index page LSN is DB4A3/C73ED0C0, and the heap page LSN is DB4FA/4CAAB9D8, so the index page was written shortly before the heap page. The blog post shows the output of SELECT * FROM heap_page_items for the heap page: it looks like a regular empty page to me.
The post appears to conclude that this must be a hardware issue because they have no explanation and PostgreSQL and the kernel are too reliable to have data corruption bugs. I've seen data corruption bugs in both databases and the kernel (as well as CPUs, for that matter), so I'm pretty skeptical of that explanation.
When something "can't happen" in your program, it makes sense to look at the layers below. Unfortunately, this often goes one of two ways: you ask people for help and they tell you that it's never one of the layers below ("it's never a compiler bug") or you stop at the conclusion "well, I guess the layer below [kernel/TCP/database/etc.] gave us corrupted data". The conclusion in this post kind of does both of these things. Of course, sometimes it _is_ a bug in one of those layers. But stopping there is no good either, especially when the application itself is non-trivial and you have no evidence that a lower layer is at fault.
People often treat a hypothesis like "the disk corrupted the data" as unfalsifiable. After the fact, that might be true, given the stack you're using. But that doesn't have to be the case. If you ran into a problem like this on ZFS, for example, you'd have very high confidence about whether the disk was at fault (because it can reliably detect when the disk returns data different from what ZFS wrote to it). I realize a lot goes into choosing a storage stack and maybe ZFS doesn't make sense for them. But if the hypothesis is that such a severe issue resulted from a hardware/firmware failure, I'd look pretty hard at deploying a stack that can reliably identify such failures. At the very least, if you see this again, you'll either know for sure it was the disk or you'll have high confidence that there's a software bug lurking elsewhere. Then you can add similar kinds of verification at different layers of the stack to narrow down the problem. In an ideal world, all the software should be able to help exonerate itself.
We certainly haven't ruled out Postgres or kernel bugs here.
> If you ran into a problem like this on ZFS, for example, you'd have very high confidence about whether the disk was at fault
Would we, though? I'll admit to not being that familiar with ZFS's internals, but I'd be a bit surprised if its checksums can detect lost writes. More generally, I'm not entirely sure how practical it would be to add verification at all layers of the stack, as you seem to be suggesting.
We'd certainly be open to considering ZFS in future if it can help track down this sort of problem.
Sorry I didn’t say it sooner: thanks for sharing this post! And for your work on Matrix. (Sorry my initial post focused on the negative. This kind of thing brings up a lot of scar tissue for me but that’s not on you.)
>> If you ran into a problem like this on ZFS, for example, you'd have very high confidence about whether the disk was at fault
> Would we, though? I'll admit to not being that familiar with ZFS's internals, but I'd be a bit surprised if its checksums can detect lost writes.
> One major feature that distinguishes ZFS from other file systems is that it is designed with a focus on data integrity by protecting the user's data on disk against silent data corruption caused by data degradation, power surges (voltage spikes), bugs in disk firmware, phantom writes (the previous write did not make it to disk), misdirected reads/writes (the disk accesses the wrong block), DMA parity errors between the array and server memory or from the driver (since the checksum validates data inside the array), driver errors (data winds up in the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file system), etc.
(end of quote)
It does this by maintaining the data checksums within the tree that makes up the filesystem's structure. Any time nodes in the tree refer to data that's on disk, they also include the expected checksum of that data. That's recursive up to the root of the tree. So any time it needs data from disk, it knows when that data is not correct.
---
> More generally, I'm not entirely sure how practical it would be to add verification at all layers of the stack, as you seem to be suggesting.
Yeah, it definitely could be a lot of work. (Dealing with data corruption in production once it's happened is also a lot of work!) It depends on the application and how much awareness of this problem was baked into its design. An RDBMS being built today could easily include a checksum mechanism like ZFS and it looks like CockroachDB does include something like this (just as an example). Adding this to PostgreSQL today could be a huge undertaking, for all I know. I've seen plenty of other applications (much simpler than PostgreSQL, though some still fairly complex) that store bits of data on disk and do include checksums to detect corruption.
Easier said than done in this case. Actually effective crosschecks preventing this issue from occurring would entail rather massive I/O and CPU amplification in common operations.
we could have run with https://www.postgresql.org/docs/current/app-pgchecksums.html turned on, but it slows things down a bunch - and turning it on in retrospect would have taken days. Also not clear that it would have caught whatever the underlying corruption was here…
- Are you using pg_repack? I'm fairly sure its logic has some holes - last time I checked its bug tracker listed potential for data corruption that could cause issues like this.
- Have you done OS upgrades? Did affected indexes have any columns affected by collations?
- Have you done analysis on the heap page? E.g. is there any valid data on the page? What is the page's LSN compared to the LSN on index pages pointing to non-existing tuples on the page?
- Yes, we've done OS upgrades. Back in 2021, our DB servers were on Debian Buster; they are now on Bookworm. We're aware of the problems caused by collation changes, and indeed that was one of the first things we checked; but we're careful to use the C locale for our database, so believe we're safe on that front.
- For the example we gave (index page 192904826, referencing heap page 264925234), the index page LSN is DB4A3/C73ED0C0, and the heap page LSN is DB4FA/4CAAB9D8, so the index page was written shortly before the heap page. The blog post shows the output of SELECT * FROM heap_page_items for the heap page: it looks like a regular empty page to me.
When something "can't happen" in your program, it makes sense to look at the layers below. Unfortunately, this often goes one of two ways: you ask people for help and they tell you that it's never one of the layers below ("it's never a compiler bug") or you stop at the conclusion "well, I guess the layer below [kernel/TCP/database/etc.] gave us corrupted data". The conclusion in this post kind of does both of these things. Of course, sometimes it _is_ a bug in one of those layers. But stopping there is no good either, especially when the application itself is non-trivial and you have no evidence that a lower layer is at fault.
People often treat a hypothesis like "the disk corrupted the data" as unfalsifiable. After the fact, that might be true, given the stack you're using. But that doesn't have to be the case. If you ran into a problem like this on ZFS, for example, you'd have very high confidence about whether the disk was at fault (because it can reliably detect when the disk returns data different from what ZFS wrote to it). I realize a lot goes into choosing a storage stack and maybe ZFS doesn't make sense for them. But if the hypothesis is that such a severe issue resulted from a hardware/firmware failure, I'd look pretty hard at deploying a stack that can reliably identify such failures. At the very least, if you see this again, you'll either know for sure it was the disk or you'll have high confidence that there's a software bug lurking elsewhere. Then you can add similar kinds of verification at different layers of the stack to narrow down the problem. In an ideal world, all the software should be able to help exonerate itself.
> If you ran into a problem like this on ZFS, for example, you'd have very high confidence about whether the disk was at fault
Would we, though? I'll admit to not being that familiar with ZFS's internals, but I'd be a bit surprised if its checksums can detect lost writes. More generally, I'm not entirely sure how practical it would be to add verification at all layers of the stack, as you seem to be suggesting.
We'd certainly be open to considering ZFS in future if it can help track down this sort of problem.
Yup. Quoting https://en.wikipedia.org/wiki/ZFS#Data_integrity:
> One major feature that distinguishes ZFS from other file systems is that it is designed with a focus on data integrity by protecting the user's data on disk against silent data corruption caused by data degradation, power surges (voltage spikes), bugs in disk firmware, phantom writes (the previous write did not make it to disk), misdirected reads/writes (the disk accesses the wrong block), DMA parity errors between the array and server memory or from the driver (since the checksum validates data inside the array), driver errors (data winds up in the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file system), etc.
(end of quote)
It does this by maintaining the data checksums within the tree that makes up the filesystem's structure. Any time nodes in the tree refer to data that's on disk, they also include the expected checksum of that data. That's recursive up to the root of the tree. So any time it needs data from disk, it knows when that data is not correct.
---
> More generally, I'm not entirely sure how practical it would be to add verification at all layers of the stack, as you seem to be suggesting.
Yeah, it definitely could be a lot of work. (Dealing with data corruption in production once it's happened is also a lot of work!) It depends on the application and how much awareness of this problem was baked into its design. An RDBMS being built today could easily include a checksum mechanism like ZFS and it looks like CockroachDB does include something like this (just as an example). Adding this to PostgreSQL today could be a huge undertaking, for all I know. I've seen plenty of other applications (much simpler than PostgreSQL, though some still fairly complex) that store bits of data on disk and do include checksums to detect corruption.