Unconventional PostgreSQL Optimizations

(hakibenita.com)

222 points | by haki 8 hours ago

10 comments

zamalek 14 minutes ago
> The index is 214 MB! That's almost half the size of the entire table. So the analysts are happy, but you? Not so much...
This is part of a broader choice: write amplification. You'd want to, of course, have the most precise index possible - but no matter how you cut it, you are incurring extra I/O for writes - one for the tuple, one per index. How you index things is heavily influenced by the mix of reads and writes, and this is why we have data warehouses/read replicas in the first place: it allows us to avoid write amplification in the write path, while having fast filtered reads (that are slightly delayed).
If you're dealing with <ridiculous number of users>, there is a good chance that you don't want to be putting BI/OLAP indices on your OLTP database. You probably don't have enough users to worry about this - but - if you ever find that your writes are becoming an issue this is something to consider.
danielheath 50 minutes ago
The hash technique for uniqueness isn’t supported for indexes because it doesn’t handle hash collisions. The authors proposed solution suffers the same problem- values which do not already exist in the table will sometimes be rejected because they have the same hash as something that was already saved.
msdrigg 3 hours ago
The most interesting thing for me in this article was the mention of `MERGE` almost in passing at the end.
> I'm not a big fan of using the constraint names in SQL, so to overcome both limitations I'd use MERGE instead:
``` db=# MERGE INTO urls t USING (VALUES (1000004, 'https://hakibenita.com')) AS s(id, url) ON t.url = s.url WHEN MATCHED THEN UPDATE SET id = s.id WHEN NOT MATCHED THEN INSERT (id, url) VALUES (s.id, s.url); MERGE 1 ```
I use `insert ... on conflict do update ...` all the time to handle upserts, but it seems like merge may be more powerful and able to work in more scenarios. I hadn't heard of it before.
[-]
- gshulegaard 3 hours ago
  IIRC `MERGE` has been part of SQL for a while, but Postgres opted against adding it for many years because it's syntax is inherently non-atomic within Postgres's MVCC model.
  https://pganalyze.com/blog/5mins-postgres-15-merge-vs-insert...
  This is somewhat a personal preference, but I would just use `INSERT ... ON CONFLICT` and design my data model around it as much as I can. If I absolutely need the more general features of `MERGE` and _can't_ design an alternative using `INSERT ... ON CONLFICT` then I would take a bit of extra time to ensure I handle `MERGE` edge cases (failures) gracefully.
  [-]
  - kbolino 43 minutes ago
    It's kinda hard to handle MERGE failures gracefully. You generally expect the whole thing to succeed, and the syntax deceptively makes it seem like you can handle all the cases. But because of MVCC, you get these TOCTOU-style spurious constraint violations, yet there's no way to address them on a per-row basis, leading to the entire statement rolling back even for the rows that had no issues. If you are designing for concurrent OLTP workloads against the table, you should probably just avoid MERGE altogether. It's more useful for one-off manual fixups.
  - awesome_dude 42 minutes ago
    That reference - my initial gut feeling was that `MERGE` felt more readable, but then I read this paragraph
    > If you want the generality of MERGE, you have to accept the fact that you might get unique constraint violations, when there are concurrent inserts, versus with INSERT ON CONFLICT, the way it's designed with its speculative insertions, guarantees that you either get an INSERT or an UPDATE and that is true even if there are concurrent inserts. You might want to choose INSERT ON CONFLICT if you need the guarantee.
    Basically, `MERGE` is susceptible to a concurrent process also writing `INSERT` where that `INSERT` and `MERGE` are unaware of one another, causing a duplicate value to be used.
- philjohn 3 hours ago
  If you're doing large batch inserts, I've found using the COPY INTO the fastest way, especially if you use the binary data format so there's no overhead on the postgres server side.
  [-]
  - sirfz 2 hours ago
    That doesn't work well with conflicts tho iirc
sc68cal 5 hours ago
Great article, shows a lot of interesting PostgreSQL features. I have used PostgreSQL and MySQL for decades, and this article showed me that I have barely scratched the surface of what is possible.
[-]
- booi 1 hour ago
  I've used Postgres for more than a decade and everytime I wade into the docs I feel the same way, I'm barely scratching the surface. It's so immensely powerful.
- whalesalad 1 hour ago
  PostgreSQL is like Emacs. It's an operating system disguised as something else.
pksunkara 5 hours ago
I think a stored generated column allows you to create an index on it directly. Isn't it better approach?
[-]
- tczMUFlmoNk 5 hours ago
  The article explains why they want to avoid this option:
  > Starting at version 14, PostgreSQL supports generated columns - these are columns that are automatically populated with an expression when we insert the row. Sounds exactly like what we need but there is a caveat - the result of the expression is materialized - this means additional storage, which is what we were trying to save in the first place!
  [-]
  - pksunkara 5 hours ago
    Thanks, missed that part. I would still be interested in knowing how much additional storage that adds, if the OP is interested in updating the article.
- zenmac 5 hours ago
  >I think a stored generated column allows you to create an index on it directly. Isn't it better approach?
  Is it also possible to create index (maybe partial index) on expressions?
  [-]
  - masklinn 2 hours ago
    That's the first solution (a function based index), however it has the drawback of fragility: a seemingly innocent change to the query can lead to not matching the index's expression anymore). Which is why the article moves on to generated columns.
- rpsw 5 hours ago
  I assume it would increase the storage usage, which they say they are trying to avoid in that example.
themafia 3 hours ago
I moved into the cloud a few years ago and so I don't get to play with fixed server infrastructure like pgsql as much anymore.
Is the syntax highlighting built into pgsql now or is that some other wrapper that provides that? (it looks really nice).
[-]
- tuetuopay 2 hours ago
  I generally use pgcli to that end. Works well, has a few niceties like clearer transaction state, better reconnect, syntax highlighting, and better autocomplete that works in many more cases than plain psql (it can even autocomplete on clauses when foreign key relations are defined!).
  My only gripe with it is its insistence on adding a space after a line break when the query is too long, making copy/paste a pain for long queries.
- folli 3 hours ago
  You can use an IDE like IntelliJ and you get syntax highlighting, code completion etc.
OptionOfT 6 hours ago
Remarkably fresh content.
It's interesting how both virtual columns and hash indexes work, but feel like they're bolted on, vs being made part of the whole ecosystem so that they work seamlessly.
pphysch 2 hours ago
Is the Hash Index method strictly superior to creating a unique "hash" column and precomputing the hash in the application or query?
[-]
- sirfz 9 minutes ago
  It'll use less storage space
lasgawe 4 hours ago
some points from this article that I didn't know before.
SigmundA 4 hours ago
>Currently, constraint exclusion is enabled by default only for cases that are often used to implement table partitioning via inheritance trees. Turning it on for all tables imposes extra planning overhead that is quite noticeable on simple queries, and most often will yield no benefit for simple queries.
PG's lack of plan caching strikes again, this sort of thing is not a concern in other DB's that reuse query plans.
[-]
- singron 3 hours ago
  PG does reuse plans, but only if you prepare a query and run it more than 5 times on that connection. See plan_cache_mode[0] and the PREPARE docs it links to. This works great on simple queries that run all the time.
  It sometimes really stinks on some queries since the generic plan can't "see" the parameter values anymore. E.g. if you have an index on (customer_id, item_id) and run a query where `customer_id = $1 AND item_id = ANY($2)` ($2 is an array parameter), the generic query plan doesn't know how many elements are in the array and can decide to do an elaborate plan like a bitmap index scan instead of a nested loop join. I've seen the generic plan flip-flop in a situation like this and have a >100x load difference.
  The plan cache is also per-connection, so you still have to plan a query multiple times. This is another reason why consolidating connections in PG is important.
  0: https://www.postgresql.org/docs/current/runtime-config-query...
  [-]
  - SigmundA 1 hour ago
    Yes manual query preparation by client [1] is what you did in MSSQL server up until v7.0 I believe, which was 1998 when it started doing automatic caching based on statement text. I believe it also cached stored procedures before v7.0 which is one reason they were recommended for all application code access to the database back then.
    MSSQL server also does parameter sniffing now days and can have multiple plans based on the parameters values it also has a hint to guide or disable sniffing because many times a generic plan is actually better, again something else PG doesn't have, HINTS [2].
    PG being process based per connection instead of thread based makes it much more difficult to share plans between connections and it also has no plan serialization ability. Where MSSQL can save plans to xml and they can be loaded on other servers and "frozen" to use that plan if desired, they can also be loaded into plan inspection tools that way as well [3].
    1. https://learn.microsoft.com/en-us/sql/relational-databases/n...
    2. https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-tr...
    3. https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-tr...