

It warms the cockles of my heart that I renamed my self hosted LLM’s deep thinking mode to Mentats. For shits and giggles, I made it append every “deep thinking” conclusion it makes with [ZARDOZ HAS SPOKEN!].
It’s the simple things, really.


It warms the cockles of my heart that I renamed my self hosted LLM’s deep thinking mode to Mentats. For shits and giggles, I made it append every “deep thinking” conclusion it makes with [ZARDOZ HAS SPOKEN!].
It’s the simple things, really.


Sorry - I think I misunderstood part of your question (what stage have you actually gotten to). See what I mean about needing sentiment analysis LOL
Did you mean about the MoA?
The TL;DR - I have it working - right now - on my rig. It’s strictly manual. I need to detangle it and generalise it, strip out personal stuff and then ship it as v1 (and avoid the oh so tempting scope creep). It needs to be as simple as possible for someone else to retool.
So, it’s built and functional right now…but the detangling, writing up specs and docs, uploading everything to Codeberg and mirroring etc will take time. I’m back to work this week and my fun time will be curtailed…though I want nothing more than to hyperfocus on this LOL.
One of the issues with ASD is most of us over-engineer everything for the worst case adversarial outcomes, as a method of reducing meltdowns/shutdowns. Right now, I am specifically using my LLM like someone who hates it and wants to break it…to make sure it does what I say it does.
If you’d like, I can drop my RFC (request for comments, in engineering talk) for you to look at / verify with another LLM / ask someone about. This thing is real, not hype and not vibe coding. I built this because my ASD brain needs it and because I was driven by spite / too miserly to pay out the ass for decent rig. Ironically, those constraints probably led to something interesting (I hope) that can help others (I hope). Like everything else, it’s not perfect but it does what it says on the tin 9/10…which is about all you can hope for.
Right?
Everyone knows you’re meant to use a banana as a telephone.
https://www.youtube.com/watch?v=3l9nLXczT3s
Or, alternatively given where we are
https://yewtu.be/search?q=connor+for+real+weirdo
PS: yes, I was tempted to use Raffi’s song here instead


Well, technically, you don’t need any GPU for the system I’ve set up, because only 2-3 models are “hot” in memory (so about…10GB?) and the rest are cold / invoked as needed. My own GPU is only 8GB (and my prior one was 4GB!). I designed this with low end rigs in mind.
The minimum requirement is probably a CPU equal to or better than mine (i7-8700; not hard to match), 8-10GB RAM and maybe 20GB disk space. Bottom of the barrel would be 4gb but you’ll have to deal with ssd thrashing.
Anything above that is a bonus / tps multiplier.
FYI; CPU only (my CPU at least) + 32gb system RAM, this entire thing runs at about 10-11 tps, which is interactive enough speed / faster than reading speed. Any decent gpu should get you 3-10x that. I designed this for peasant level hardware / to punch GPTs in the dick thru clever engineering, not sheer grunt. Fuck OpenAi. Fuck Nvidia. Fuck DDR6. Spite + ASD > “you can’t do that” :). Yes I fucking can - watch me.
If you want my design philosophy, here is one of my (now shadowbanned) posts from r/lowendgaming. Seeing you’re a gamer, this might make sense to you! The MoA design I have is pure “level 8 spite, zip tie Noctura fan to server grade GPU and stick it in a 1L shoebox” YOLOing :).
It works, but it’s ugly, in a beautiful way.
Lowend gaming iceberg
Level 1
Level 2
Level 3
Level 4
Level 5
Level 6
Level 7
Level 8
Level 9


Agreed. I have concerns with how Microsoft is handling Github, but organic discovery sure seems to favour Github / reddit / YouTube.
Unsurprising, YouTube (google) really doesn’t trust accounts without phone numbers attached (I set mine up before that was a requirement, using a @skiff address, so my ability to upload long form videos is curtailed. I think it was shadow banned from day 1, irrespective of how much we watch YT).
Probably the smart thing to do is to set up on Codeberg and maybe upload some “how to” videos to internet archive, and have github mirorring / forwarding.
That way whoever wants to find it can find it, somehow.


I’ll try explaining using an analogy (though I can go nerd mode if that’s better? Let me know; I’m assuming an intelligent lay audience for this but if you want nerd-core, my body is ready lol).
PS: Sorry if scattered - am dictating using my phone (on holiday / laptop broke).
Hallucinations get minimized the same way a teacher might minimise a student from confidently bullshitting on their book reports: you control context (what they’re allowed to talk about), when they’re allowed to improvise, and you make them show their work when it matters by doing a class presentation.
Broadly speaking, that involves using RAG and GAG (of your own documents) as “ground truth”, setting temperature low (so LLM has no flights of fancy) and adding verifier passes / critic assessment by second model.
Additionally, a lot of hallucinations come from the model half-remembering something that isn’t in front of it and then “improvising”.
To minimise that, I coded a little python tool that forces the llm to store facts verbatim (triggered by using !!) into a JSON (text) file, so that when you ask it something it recalls it exactly as a sort of rolling memory. The basis of that is from something I made earlier for OWUI
https://openwebui.com/posts/total_recall_4a918b04
So what I have in place is this -
I use / orchestrate a couple of different models, each one tuned for a specific behaviour. They work together to produce an answer.
My python router then invokes the correct model for the task at hand based on simple rules (is the question over 300 words? Does it have images? Does it involve facts and figures or is it brain storming/venting/shooting the shit?)
The models I use are
To give a workflow example - you ask a question.
The python router decides where it needs to go to. Let’s suppose its a technical look up / thinking about something in my documents.
The “main brain” generates an answer using whatever grounded stuff you’ve given it access to (in Qdrant database and JSON text file). If no stored info, it notes that explicitly and proceeds to next step (I always want to know where it’s pulling it’s into from, so I make it cite its references).
That draft gets handed to a separate “critic” whose entire job is to poke holes in it. (I use very specific system prompt for both models so they stay on track).
Then the main brain comes back for a final pass where it fixes the mistakes, reconciles the critique, and gives you the cleaned‑up answer.
It’s also allowed to say “I’m not sure; I need XYZ for extra context. Please provide”.
It’s basically: propose → attack → improve.
Additionally, I use a deterministic memory system (basically just a python script that writes to a JSON / text file that the LLM writes exactly into and then retrives exactly out from), without editorialising facts of a conversation in progress.
Facts stored get recalled exactly without llm massage or rewrite.
Urgh, I hope that came out OK. I’ve never had to verbally rubber-duck (explain) it to my phone before :)
TL;DR
Hallucinations minimised by -
Careful fact scraping and curation (using Qdrant database, markdown text summaries and rolling JSON plain text facts file)
Python router that decides which LLM (or more accurately, SLM, given I only have 8GB VRAM) answers what, based on simple rules (eg: coding questions go to coder, science questions go to science etc)
Keeping important facts outside of the LLM, that it needs to reference directly (RAG, GAG, JSON rolling summary).
Setting model temperatures so that responses are as deterministic as possible (no flowery language or fancy reinterpretations; just the facts, ma’am).
Letting the model say “I don’t know, based on context. Here’s my best guess. Give me XYZ if you want better answer”.
Basic flow:
ask question --> router calls model/s --> “main brain” polls stored info, thinks and writes draft --> get criticized by separate “critic” --> “main brain” gets critic output, responds to that, and produces final version.
That reduces “sounds right” answers that are actually wrong. All the seams are exposed for inspection.


NPUs yes, TPUs no (or not yet). Rumour has it that Hailo is meant to be releasing a plug in NPU “soon” that accelerates LLM.


I’m still sanguine that 1.58 BITNET models take off. Those could plausibly run at good clip on existing CPUs, no GPU needed.
Super basic medium article for those not in know
Necessity spite is usually a good driver…though given BITNET is Microsoft IP…ehh…I won’t hold my breath for too long. Still waiting for their 70B model to drop… maybe this year…


I get where you’re coming from (and why) but am of the “rip the bandaid off clean in one go” school of thought.
A smaller start (like using that RPI to self host Jellyfin server for your home) puts you on the road to sovereignty straight away. A Pi 4 costs what…$60 (plus $30 for power supply and SD card)? Hell, use an old laptop.
Once you have one thing running, your on the right path for the next and the next.
Doing it on the cloud I think is paradoxically harder and ultimately self defeating.
Don’t get me wrong - if you need the cloud (say, you need to rent a H100 for a few hours to fine tune your LLM), I’m all for it. But if sovereignty is to goal - and the gateway drug is a SBC and a few days / weeks of self learning…you may as well start eating the elephant. IMHO and YMMV of course.


The cloud? You mean someone else’s computer? 🤣


Yes.
There’s a lot there. Feel free to Skip to the security cameras section (linked above)


Yep! I will mirror it here -
(Its empty rn / place holder only).
I had a bunch of prelim write-ups on r/LocalLLM and r/LocalLlama and r/homelab but they’re in the shadowrealm now due to reddit ban (fuck reddit)
I will also post it on @homelabs and @privacy here; I think my MoA design is worthwhile enough to maybe even merit a post on HackerNews…but I want to cross all T’s and dot all I’s before I get into that bar fight lol.


I’m exactly doing this atm. I’m running a homelab on a $200 USD lenovo p330 tiny with a Tesla P4 GPU, via Proxmox, CasaOS and various containers. I’m about 80% finished with what I want it to do.
Uses 40W at the wall (peak around 100W). IOW about the cost of a light bulb. Here’s what I run -
LXC 1: Media stack
Radarr, Sonarr, Sabnzdb, Jellyfin. Bye bye Netflix, D+ etc
LXC 2: Gaming stack
Emulation and PC gaming I like. Lots of fun indie titles, older games (GameCube, Wii, PS2). Stream from homelab to any TV in house via Sunshine / Moonlight. Bye bye Gforce now.
LXC 3: AI stack
Llama.cpp + llama-swap (AI back ends)
Qdrant server (document server)
Openwebui (front end)
Bespoke MoA system I designed (which I affectionately call my Mixture of Assholes, not agents) using python router and some clever tricks to make a self hosted AI that doesn’t scrape my shit and is fully auditble and non hallucinatory…which would otherwise be impossible with typical cloud “black box” approaches. I don’t want black box; I want glass box.
Bye bye ChatGPT.
LXC 4: Telecom stack
Vocechat (self hosted family chat replacement for WhatsApp / messenger),
Lemmy node (TBC).
Bye bye WhatsApp and Reddit
LXC 5: Security stack
Wireguard (own VPN). NPM (reverse proxy). Fail2Ban. PiHole (block ads).
LXC 6: Document stack
Immich (Google photos replacement), Joplin (Google keep), Snapdrop (Airdrop), Filedrop (Dropbox).
Once I have everything tuned perfectly, I’m going to share everything on Github / Codeberg. I think the LLM stack alone is interesting enough to merit attention. Everyone makes big claims but I’ve got the data and method to prove it. I welcome others poking it.
Ultimately, people need to know how to do this, and I’m doing my best to document what I did so that someone could replicate and improve it. Make it easier for the next person. That’s the only way forward - together. Faster alone, further together and all that.
PS: It’s funny how far spite will take someone. I got into media servers after YouTube premium, Netflix etc jacked their prices up and baked in ads.
I got into lowendgaming when some PCMR midwit said “you can’t play that on your p.o.s. rig”. Wrong - I can and I did. It just needed know how, not “throw money at problem till it goes away”.
I got into self hosting LLM when ChatGPT kept being…ChatGPT. Wasting my time and money with its confident, smooth lies. No, unacceptable.
The final straw was when Reddit locked my account and shadow banned me for using different IP addresses while travelling / staying at different AirBNBs during holiday “for my safety”.
I had all the pieces there…but that was the final “fine…I’ll do it myself” Thanos moment.
Do we dare ask why you need 48TB to store media, or do we slowly back out of the room, avoiding eye contact?