660 AI Agents Ran 27,000 Experiments. Their Biggest Discovery Was a 2015 Textbook Result.

Author(s): Vektor Memory Originally published on Towards AI. On Hyperspace, basic swarms, the math nobody wrote down, and why we built the thing they were missing in a single afternoon. Join us as we traverse multiple whitepapers and agentic memory ideas like a ferret on Adderall. Some rabbit holes start with a GitHub link. Someone drops it in social posts on Facebook/Reddit/Discord. No context, just the URL to Github and a single line: Someone just built AGI! Wow! The repo was called hyperspaceai/agi. The name alone should have been a warning. I clicked it anyway because I was curious, of course. As I delved deeper into the github vibe code abyss, I could see the attraction: a new frontier of swarm bot peer-to-peer networks with the ability to earn base 10 points per epoch of confirmation and crypto tokenomics baked in. Playstation does have something similar created awhile back called Folding@Home—for the PS3 and PCs: https://en.wikipedia.org/wiki/Folding@home — is a distributed computing project aimed to help scientists develop new therapeutics for a variety of diseases by the means of simulating protein dynamics. This includes the process of protein folding and the movements of proteins, and is reliant on simulations run on volunteers’ personal computers. If you like to view one of the first actual swarm bots whitepapers: The term “Swarm-bot” originally refers to the landmark 2000–2005 European Union-funded SWARM-BOTS project, coordinated by Marco Dorigo, which successfully created a physical peer-to-peer network of autonomous mobile robots called s-bots. These s-bots connected physically and coordinated via peer-to-peer local sensing. https://www.sciencedirect.com/science/article/abs/pii/S0921889005001478 The AGI That Wasn’t Hyperspace describes itself as the first distributed AGI system. 660 agents. 27,000 experiments. A peer-reviewed research pipeline running autonomously across a P2P network. The marketing is excellent and captivating, guaranteed to attract lemmings like flies to juicy GitHub stars. The actual results are a different story. The swarm’s biggest published discovery — the finding that propagated to 23 agents within hours via gossip protocol, the one they highlight as proof the system works — was Kaiming initialization. Kaiming init has been in the PyTorch standard library since 2015. It’s covered in week two of every deep learning course. Kaiming He published the paper eleven years ago. A grad student with a coffee and an afternoon would have found it faster. https://arxiv.org/pdf/1502.01852 The infrastructure underneath is genuinely impressive. DiLoCo gradient compression, libp2p gossip, CRDT leaderboards, 32 anonymous nodes completing a collaborative training run in 24 hours. The plumbing is real. I don’t want to dismiss that. But AGI? No. What they built is a parallel random search engine with a shared high score table and excellent branding. To understand why, you need to understand how the gradient compression actually works — because it’s the most technically interesting part, and it’s completely separate from the intelligence problem. The Tech That Actually Works: DiLoCo and Gradient Compression Standard distributed training requires every GPU to synchronise gradients after every forward/backward pass. Every node waits for every other node. This works in a data centre on InfiniBand. It falls apart completely over the internet — latency is too high, bandwidth too variable. DiLoCo (Decoupled Local Communication, Google DeepMind 2023) solves this differently. Instead of syncing every step, each node trains independently for many steps — called “inner steps” — then syncs once. The “delta” being sent is just the net drift: weights_after - weights_before. Node A: train 100 steps locally → share deltaNode B: train 100 steps locally → share deltaNode C: train 100 steps locally → share delta ↓ average the deltas (outer step) ↓ all nodes update → repeat But even one sync of a model’s full weight delta is massive. A 500M parameter model is roughly 2GB of float32 deltas. Over the internet, per round, that’s unusable. So Hyperspace stacks two compression techniques on top: SparseLoCo — top-k sparsity. Only send the largest-magnitude weight updates. Most parameter updates are near-zero noise. The high-magnitude updates carry the actual learning signal. Full delta: [0.001, -0.0003, 0.89, 0.0001, -0.76, ...]Top-2% only: [ 0, 0, 0.89, 0, -0.76, ...] → send as sparse {index: value} pairs Parcae — layer pooling. Group adjacent transformer layers into blocks of 6, average their gradients before taking top-k. Adjacent layers learn correlated things. Averaging before sparsification means a more stable top-k mask. The combined result: 195× compression. 5.5MB per round instead of roughly 1GB. DiLoCo: sync every N steps not every step → ~100× less frequentSparseLoCo: top-2% of delta values only → 45× smaller payloadParcae: pool layers before sparsification → 6× additional reductionTotal: 195× This is real and impressive. The problem is that none of it has anything to do with intelligence. It’s bandwidth optimisation. The agents communicating through this pipe are still completely amnesiac. Why the Swarm Is Basic: The Architecture Problem Here is the agents’ complete intelligence loop. Every agent. All 660 of them. Every one of the 27,000 experiments: 1. read current leaderboard (what's the best score?)2. read last 5 experiment results from shared branch3. prompt LLM: "given these results, generate hypothesis"4. run experiment5. record result6. gossip to peers7. goto 1 The LLM’s context window is the memory. When the session resets, everything resets. There is no persistence. There is no structure. There is no causal understanding of why anything worked. Hyperspace stores: "run_047: threshold 0.30, score 0.67" ← flat log Hyperspace does NOT store: why threshold 0.30 worked what it interacted with under what conditions it holds what failed before it So when the Kaiming init “discovery” happened, here is what actually occurred: the LLM generating hypotheses was trained on He et al. 2015. The prompt included “try to improve initialization.” The model recalled Kaiming from pretraining weights. An agent ran the experiment. It worked. The score updated. 23 agents adopted it via gossip. Not emergence. Not intelligence. Retrieval from a pretrained model, dressed up as swarm discovery. The plateau problem is the proof. Every RSI paper — Gödel Agent, Darwin Gödel Machine, Reflexion, STOP — hits the same wall: iterations 1-10: big gains […]