Unix "find" expressions compiled to bytecode

(nullprogram.com)

86 points | by rcarmo 8 hours ago

2 comments

drob518 2 hours ago
From the article:
> I was later surprised all the real world find implementations I examined use tree-walk interpreters instead.
I’m not sure why this would be surprising. The find utility is totally dominated by disk IOPS. The interpretation performance of find conditions is totally swamped by reading stuff from disk. So, keep it simple and just use a tree-walk interpreter.
[-]
- Someone 48 minutes ago
  Is it truly simpler to do that? A separate “command line to byte codes” module would be way easier to test than one that also does the work, including making any necessary syscalls.
  Also, decreasing CPU usage many not speed up find (much), but it would leave more time for running other processes.
  [-]
  - drob518 8 minutes ago
    If it was easier to interpret byte codes, nobody would use a tree-walk interpreter. There’s no performance reason to use a tree-walk interpreter. They all do it because it’s easy. You basically already have the expression in tree form, regardless of where you end up. So, stop processing the tree and just interpret it.
- chubot 1 hour ago
  Yeah that's basically what was discussed here: https://lobste.rs/s/xz6fwz/unix_find_expressions_compiled_by...
  And then I pointed to this article on databases: https://notes.eatonphil.com/2023-09-21-how-do-databases-exec...
  Even MySQL, Duck DB, and Cockroach DB apparently use tree-walking to evaluate expressions, not bytecode!
  Probably for the same reason - many parts are dominated by I/O, so the work on optimization goes elsewhere
  And MySQL is a super-mature codebase
tasty_freeze 5 hours ago
That is a fun exercise, but I imagine the time to evaluate the conditional expression is a tiny fraction, just a percent or less, than the time it takes to make the file system calls.
[-]
- nasretdinov 4 hours ago
  For many cases you don't even need to make stat() call to determine whether or not the file is a directory (d_type specifically can tell it: https://man7.org/linux/man-pages/man3/readdir.3.html). That's what allows find(1) to be so quick
  [-]
  - loeg 52 minutes ago
    You could imagine determining from the parsed expression whether or not stat'ing was required.
    NFS has readdirplus, but I don't think it ever made its way into Linux/POSIX. (Some filesystems could efficiently return dirents + stat information.)
    [-]
    - nasretdinov 16 minutes ago
      > readdirplus
      Well, it definitely does _something_, because on NFS the subsequent stat() calls after reading the directory names do indeed complete instantly :), at least in my testing.
      [-]
      - loeg 5 minutes ago
        I mean, readdirplus as a local filesystem API. Ultimately unix programs are just invoking getdents() (or equivalent) + stat() (or statx, whatever). Linux nfsclient probably caches the result of readdirplus for subsequent stat.
- CerryuDu 4 hours ago
  ... not to mention the time it takes to load directory entries and inodes when the cache is cold.