Ohi, I'm the original creator of Searx, but due to the limitations of the metasearch concept I'm not involved in the development anymore. My new search project is https://github.com/asciimoo/hister (https://hister.org/).
Hister is a full text indexer for websites and local files which automatically saves all the visited pages rendered by your browser. Storing full page content allows serving offline result previews and the full page content via MCP.
I'm sorry for not taking the time to read the docs, but I have a question.
Some 20 years ago a friend of mine has set up a local proxy (python if I'm not mistaken) that was gathering all his web traffic and served him as a long term memory. The proxy had a web interface and allowed him to quickly find something he saw ca. 10 days ago, or that specific algorithm he recalls but can't remember it's name.
For years I've been collecting links to different work related trivia which I use on a daily basis as a rabbit-from-a-hat solution to answer random question from friends and coworkers. For example someone randomly asked me for an idea for color palette for data charts and I can immediately give them a scientific research into the color palette. Or an obscure algorithm.
But with time the collection has grown substantially and it's really cumbersome to find the proper things.
Absolutely, this is a great example where Hister can shine.
I started Hister as a proxy as well, but quickly switched to the current extension based approach, because intercepting HTTPS traffic requires a MiTM proxy which is much more painful to setup than installing a browser extension.
would it be possible to gdrive/rsync/git the data between machines and then use the data on an online server for retrieval (given that I would handle data sync myself)?
also what exactly are you using for search? does it support trigrams? how do you sort results?
Also very interested in this. I was playing around with doing the same thing with YaCY. I want the proxy aspect so that I can proxy my phone traffic through it as well.
Wow! that looks like a bit of software I have been dreaming about for awhile - will definately check out! You're at least doing something right in communicating the reasons why and appeal for starters! All the best!
Always excited to see new things like Hister in the search space. What are the scaling limits, as far as you can tell in terms of how much can it hold before queries start breaking down or become too slow to be useful? Could it evolve into a general internet search engine if, say, enough trusted members of a geo-distributed YugabyteDB cluster and an army of crawlers built a sufficient index?
> What are the scaling limits, as far as you can tell in terms of how much can it hold before queries start breaking down or become too slow to be useful?
There has been no stress tests in this regard. The indexer lib Bleve [1] can handle millions of documents according to their documentation.
> Could it evolve into a general internet search engine if, say, enough trusted members of a geo-distributed YugabyteDB cluster and an army of crawlers built a sufficient index?
My long term goal is exactly this. I'd like to add federation/P2P feature [2][3] to evolve from being a private search companion. I'd appreciate any help designing the system.
I love your idea and wondered why saving and indexing browser visited pages was not being done. Does this handle large amounts of local files, for example 10-20TB across file types like Powerpoint, Excel, Word, and PDF?
In its current form it cannot handle this amount of data efficiently (and doesn't support powerpoint/excel/word yet), but this is a valid use-case, I've added a TODO item to experiment with it.
Both are search engines, but that's all the similarity. Hister has a traditional crawler, but its biggest strength is automatically indexing browser tabs as those are rendered. This way it bypasses authentication, CloudFlare, captchas and most of the annoying limitations of traditional crawlers.
Hister also provides full offline result previews. Check out the small read-only demo: https://demo.hister.org/
I installed this a while back and honestly I almost never touch it. It turns out that for me searching my history doesn't really replace a search engine at all. The built in extractor list is pretty limited and adding them seems like too much of an ordeal for me to bother.
Sure, it cannot fully replace web search engines (yet), but it can reduce the dependence on these services more and more as your index grows. Hister is designed to support quickly falling back to traditional search engines with a single hotkey if no results found.
I agree, we should add more extractors [1]. Can you recommend extractors you missed?
TinySearch wraps this and works well for agents. It's better than the native SearXNG MCP because it optimizes the context before it even gets to the agent so as to not waste tokens.
SearXNG is my daily internet search now +5 years; with YaCY Backends and else as fallback. I also build internal document search or RAG applications with this setup (SearXNG also support json results).
However, there are some downer I accept because of privacy:
1. Its slower and the results are not that good then with others. But fast and good enough for most of my queries.
2. From time to time you get blocked on the duckduckgo, brave or whatever search and you must solve some captures. You can prevent this by getting and using API-Keys from them.
The nice thing about using your own backend is, that you can prio it in the results and for example, if I crawl the smallweb and other site important for myself, this sites come up first in the results.
It has a JSON mode that you need to enable in settings and then you can create a simple python script to interact with it or have the agent use `curl` and `jq` to interact with it.
I am also interested in what a full local AI stack with web search and other tools looks like. As far as I can tell, SearX does not embed an MCP server, so it can't be directly called from llama-server for example. Open WebUI does have an integration for SearX and other providers, but the results I obtained weren't particularly impressive.
I've been using SearXNG for a few years now, however I've been trying out Degoog as a SearXNG alternative since I've had issues with engines constantly failing or being slow since day 1 of using SearXNG, but Degoog has worse results with the same engines. It's a shame since I'm having to pick between slower but better results, or very fast but worse results.
I’ll have to try, I’ve only recently learned Exa pricing is a bit crazy (especially on searches where you source 30-40 sources)I just used it be default and then was like oh damn when I got hit
I've been using this for some projects. It's exceptional and I recommend it highly.
I actually included a recipe to deploy it to kubernetes in typekro, my TypeScript infrastructure-as-code project for kubernetes: https://typekro.run/api/searxng/
Yeah, I find that searx results are way more relevant to what I’m actually looking for than a single engine. There’s so much manipulation going on that if you don’t aggregate multiple engines, it’s near impossible to get what you want.
Hister is a full text indexer for websites and local files which automatically saves all the visited pages rendered by your browser. Storing full page content allows serving offline result previews and the full page content via MCP.
Take a look at how the MCP can be utilized: https://hister.org/posts/give-your-ai-assistant-a-private-me...
Some 20 years ago a friend of mine has set up a local proxy (python if I'm not mistaken) that was gathering all his web traffic and served him as a long term memory. The proxy had a web interface and allowed him to quickly find something he saw ca. 10 days ago, or that specific algorithm he recalls but can't remember it's name.
For years I've been collecting links to different work related trivia which I use on a daily basis as a rabbit-from-a-hat solution to answer random question from friends and coworkers. For example someone randomly asked me for an idea for color palette for data charts and I can immediately give them a scientific research into the color palette. Or an obscure algorithm.
But with time the collection has grown substantially and it's really cumbersome to find the proper things.
Would your project be a good fit for my problem?
I started Hister as a proxy as well, but quickly switched to the current extension based approach, because intercepting HTTPS traffic requires a MiTM proxy which is much more painful to setup than installing a browser extension.
also what exactly are you using for search? does it support trigrams? how do you sort results?
There has been no stress tests in this regard. The indexer lib Bleve [1] can handle millions of documents according to their documentation.
> Could it evolve into a general internet search engine if, say, enough trusted members of a geo-distributed YugabyteDB cluster and an army of crawlers built a sufficient index?
My long term goal is exactly this. I'd like to add federation/P2P feature [2][3] to evolve from being a private search companion. I'd appreciate any help designing the system.
[1] https://blevesearch.com/docs/Home/ [2] https://github.com/asciimoo/hister/discussions/432 [3] https://hister.org/posts/public-search
I agree, we should add more extractors [1]. Can you recommend extractors you missed?
[1] https://github.com/asciimoo/hister/issues/305
https://github.com/MarcellM01/TinySearch
The nice thing about using your own backend is, that you can prio it in the results and for example, if I crawl the smallweb and other site important for myself, this sites come up first in the results.
I'm curious what setups folks use to provide this functionality.
Since the quantized 24B parameter Gemma model came out, I've had good luck with tool calling on a 4070 Ti Super.
Successful tool calling is what finally made the local experience useful.
I should note this is for the general and not coding specific context.
It's at the bottom of this page: https://docs.searxng.org/admin/settings/settings_search.html
I actually included a recipe to deploy it to kubernetes in typekro, my TypeScript infrastructure-as-code project for kubernetes: https://typekro.run/api/searxng/