VAC Memory System – SOTA RAG (80.1% LoCoMo) Built by a Handymen Using Claude CLI

(github.com)

1 points | by VAC-AGI 1 hour ago

3 comments

VAC-AGI 32 minutes ago
This is a crucial clarification about the architecture.
The MCA algorithm does not perform the majority of document retrieval; it is merely a mechanism that complements FAISS and BM25 for greater coverage. My system uses a truly hybrid retriever:
FAISS (with BGE-large-en-v1.5, 1024D) handles the primary retrieval load, pulling 60%+ documents.
MCA acts as a specialized gate. The logic is: if two retrievers miss the ground truth (GT), the third one catches it. They complement each other—what FAISS misses, MCA finds.
Pipeline Magic: Despite this aggressive union coverage (which often exceeds 85% documents), the reranker plays an equally critical role. The ground truth (GT) doesn't always reach the top 15, and the final LLM doesn't always grasp the context even when it's present. All the magic is in the deterministic pipeline orchestration.
LLM Agnosticism: The LLM (gpt-4o-mini) is only involved in the final answer generation, which makes my system highly robust and LLM-agnostic. You can switch to a weaker or stronger generative model; the overall accuracy will not change by more than ±10.
VAC-AGI 43 minutes ago
Hello everyone, thank you for the intense feedback over the last hour.
I see two main concerns emerging, and I want to be completely transparent:
1.so Files and IP Protection The core MCA algorithm is compiled to protect the IP while I seek $100k Pre-Seed funding. This is the "lottery ticket" I need to cash to scale. I did not retrain any model. The system's SOTA performance comes entirely from the proprietary MCA-first Gate logic. Reproducibility is guaranteed: you can run the exact binary that produced the 80.1% SOTA results and verify all logs.
2. Overfit vs. Architectural Logic All LLM and embedding components are off-the-shelf. The success is purely due to the VAC architecture. MCA is a general solution designed to combat semantic drift in multi-hop, conversational memory. If I was overfitting by tuning boundaries, I would have 95%+ accuracy, not 80.1%. The 16% failures are real limitations.
Call to Action: Next Benchmarks I need your recommendations: I am looking for the toughest long-term conversation benchmarks you know. What else should I test the VAC Memory system on to truly prove its generalizability?
GitHub: https://github.com/vac-architector/VAC-Memory-System
I appreciate the honesty of the HN community and your help in validating my work.