BIIC: Bio-Inspired Information Cell

Language / 语言

🔴 The Problem with Tokens

Every current language model compresses all semantics into a single flat vector — and that vector gets overwritten layer by layer. There is no mechanism to distinguish what a token originally meant from what inference added to it. The consequences are structural, not incidental.

💥

Irreversible semantic loss

Deep layers overwrite original semantics. Residual connections are engineering patches, not mathematical guarantees.

🌊

Information overload

Residual connections only add, never subtract. Irrelevant intermediate states accumulate indefinitely.

📉

Long-context degradation

No active forgetting mechanism. Performance systematically degrades as sequence length grows.

💡 Key Insight: Learn from DNA

DNA simultaneously achieves three things that tokens cannot: permanent genome preservation, dynamic epigenetic read/write, and active erasure of outdated marks. We map each mechanism directly to a mathematical structure.

🧬 DNA Architecture

Genome (immutable)

Permanent identity, never overwritten

Epigenome (read/write)

Dynamic context, updated per cell state

TET demethylase

Active erasure of outdated methylation marks

→

⚡ BIIC Architecture

Grade-0 (invariant core)

Algebraically invariant under any sandwich product — theorem, not heuristic

Grade-1~4 (equivariant)

Evolves with inference context, carries reasoning state

GradeAwareEraser

Controlled decay on equivariant grades only; grade-0 change = 0.0 (exact)

The critical property: in Cl(4,1) conformal geometric algebra, the grade-0 scalar is algebraically invariant under the sandwich product RMR̃ for any rotor R. This is not an approximation — it is a theorem. We use this structure to separate what a token is from what inference knows about it.

📊 Results

Phase 1 — Mathematical Verification ✅

Grade-0 invariance after 100 sandwich products — Fig 1. Grade-0 error after 100 consecutive sandwich product transformations — stays at 10⁻⁶ level across 3 independent seeds. Threshold: 10⁻⁴.

Grade-0 invariance error (100 transforms, 3 seeds)

6.56×10⁻⁶

± 4.95×10⁻⁶ · threshold: 10⁻⁴

Multi-channel leakage (C=8 channels)

0.0

Exact zero — channels are fully independent

Eraser effect on grade-0

0.0

50 consecutive erase ops — grade-0 unchanged

Gradient flow ratio (10-layer chain)

0.55

Healthy range: 0.1 – 10

Phase 2 — Encoding-Decoding Pipeline ✅

AllGradeDecoder training curve — Fig 2. AllGradeDecoder overfitting test — loss converges from 7.08 to 0.012 (99.8% improvement, 3 seeds, std < 0.0001).

Grade separation after training — Fig 3. Grade L2 norms after 300 training steps — grade-2 is most active (10D, relational information), consistent with theory. Separation emerges without explicit supervision.

All-grade vs. grade-0 only decoding

5.3×

0.006 vs 0.032 · equivariant grades carry real information

Grade-0 after 6 inference layers

0.0

Exact zero change — end-to-end training confirmed

Token discrimination via grade-0 — Fig 4. Average cosine similarity between different tokens' grade-0 representations: 0.029 ± 0.013 (near-orthogonal). Grade-0 strongly discriminates token identity.

Phase 4 Dry Run — Architecture Validated ✅

Parameters (n_channels=8)

10M

Full pipeline: encoder + SlowFast + DualCodebook

Peak VRAM (n_channels=8)

295MB

Fixed size — does not grow with sequence length

🗺 Roadmap

Phase	Goal	Status
Phase 1	Mathematical verification of Cl(4,1) properties (invariance, equivariance, Eraser)	✅ Complete
Phase 2	Encoding-decoding pipeline: TokenToIC → BIICLayer → AllGradeDecoder	✅ Complete
Phase 3	6-group controlled experiment — H1: geometry vs. orthogonality · H2: equivariant structure vs. dimensionality · H3: Eraser on long sequences	🔄 Running
Phase 4	MVP language model: SlowFastBIIC + DualCodebook, WikiText-103, no residual / no KV cache	📋 Planned

⚡ What This Enables

🔒 Lossless long-context

Grade-0 preserves original semantics regardless of inference depth — algebraically guaranteed, not approximated.

💾 No KV Cache

Mutable state replaces key-value storage. Memory footprint is fixed — does not grow with sequence length.

🔍 Built-in interpretability

Grade decomposition separates "what the token is" from "what inference knows". Each grade has a distinct role.

🌐 Natural multimodal alignment

Different modalities map into the same algebra space. Grade-0 cores are naturally comparable across text, image, and audio.

📐 O(L) complexity

SlowFast architecture eliminates quadratic attention. Slow network updates every K steps; fast network reads every step.

🧹 Active forgetting

GradeAwareEraser decays equivariant grades toward semantic prior. Information entropy stays bounded in long inference.

📍 How We Differ

BIIC is not a new transformer variant. It replaces the information carrier itself — the token — before any transformer-style processing occurs.

Approach	What they replace	Input/output	Invariance guarantee
GATr (2023)	Attention mechanism	Still tokens	E(3) equivariance only
Versor (2026)	Internal computation	Still tokens	SE(3) equivariance only
FoldToken (2024)	Protein structure tokens	Domain-specific	SE(3) invariant encoder
BIIC (ours)	The information carrier itself	Multivector (invariant + equivariant)	Grade-0 invariant by theorem · Eraser preserves invariant core exactly

🚀 Quick Start

bash · Phase 1 verification (CPU, ~2 min)

pip install torch numpy scipy matplotlib

# Phase 1: Mathematical verification (CPU is enough)
python tests/test_phase1.py

# Phase 2: Pipeline verification (CPU, ~10 min)
python tests/test_decoder_basic.py
python tests/test_encoder.py
python tests/test_full_pipeline.py

📁 Repository Structure

BIIC/

BIIC/
├── src/
│   ├── clifford_cl41.py      # Cl(4,1) Golden Reference (never delete)
│   ├── rotor_utils.py        # Rotors & sandwich products
│   ├── eraser_ops.py         # GradeAwareEraser
│   ├── token_to_ic.py        # TokenToImmutableCore encoder
│   ├── all_grade_decoder.py  # DualCodebook decoder
│   ├── mutable_state.py      # BIICLayer (Writer + Eraser)
│   └── biic_loss.py          # Annealed auxiliary losses
├── tests/                    # 10+11 validation tests
├── results/                  # JSON data, 3 seeds each phase
├── figures/                  # Paper figures (fig1–fig4)
└── LICENSE

📚 References

Brehmer et al. (2023). Geometric Algebra Transformer (GATr). NeurIPS 2023.
Huy & Hirst (2026). Versor: A Geometric Sequence Architecture.
Ji (2026). CliffordNet: All You Need is Geometric Algebra.
Anonymous (2026). Toward a Functional Geometric Algebra for NLP.
Dasgupta et al. (2026). Invariant Features in Language Models.
Wu & Zhang (2017). TET-mediated active DNA demethylation. Nature Reviews Genetics.

💬 Collaboration

🧬

Val Huang

Independent Researcher · BIIC Author

💬 WeChat: llmbbs

Interested in collaborating on the paper, contributing experiments, or exploring new information-theoretic paradigms? Reach out.

📄 Citation

@misc{huang2026biic,
  title = {Bio-Inspired Information Cell: A Geometric Algebra Framework for Lossless Information Representation in Language Models},
  author = {Huang, Zhongchang},
  year = {2026},
  note = {Phase 1–2 complete, Phase 3–4 ongoing. }
}

📜 License

Business Source License 1.1 — free for non-production and research use. See LICENSE for details.

🔴 Token 的根本缺陷

当前所有语言模型都把语义压进一个扁平向量，每经过一层推理就被覆盖一次。没有任何机制能区分这个 token 原本是什么意思和推理过程给它附加了什么信息。这是结构性缺陷，不是可以调参解决的问题。

💥

语义损耗不可逆

深层网络覆盖原始语义。残差连接是工程补丁，不是数学保证。

🌊

信息过载

残差连接只加不减，无关中间状态无限累积，没有主动清除机制。

📉

长推理系统性退化

随序列长度增加，KV Cache线性膨胀，推理质量系统性下降。

💡 核心思路：向DNA学习

DNA同时做到了三件 token 做不到的事：永久保存基因组、动态读写表观标记、主动擦除过时标记。我们把每个机制直接映射到一个数学结构。

🧬 DNA 架构

基因组（不可变）

永久的身份信息，从不被覆盖

表观基因组（可读写）

动态上下文，随细胞状态更新

TET 去甲基化酶

主动擦除过时的甲基化标记

→

⚡ BIIC 架构

Grade-0（不变核）

在任意 sandwich 积变换下代数严格不变——定理保证，非近似

Grade-1~4（等变分量）

随推理上下文动态演化，携带推理状态

GradeAwareEraser

仅对等变分量施加受控衰减；grade-0 变化量 = 0.0（精确）

关键性质：在 Cl(4,1) 保形几何代数中，grade-0 标量在旋转子 sandwich 积 RMR̃ 下代数严格不变。这不是近似——是定理。我们用这个结构把token 是什么和推理知道什么物理隔离。

📊 实验结果

Phase 1 — 数学验证 ✅

Grade-0不变性验证 — 图1. Grade-0 在100次连续 sandwich 积变换后的误差分布——3个随机种子均低于阈值 10⁻⁴，均值 6.56×10⁻⁶。

Grade-0 不变性误差（100次变换，3种子）

6.56×10⁻⁶

± 4.95×10⁻⁶ · 阈值：10⁻⁴

多通道间信息泄漏（C=8通道）

0.0

精确为零——通道间完全独立

Eraser 对 grade-0 的影响

0.0

50次连续 Eraser 操作后 grade-0 不变

梯度流比值（10层链）

0.55

健康范围：0.1 – 10

Phase 2 — 编解码链路 ✅

AllGradeDecoder训练曲线 — 图2. AllGradeDecoder 过拟合测试——loss 从 7.08 收敛至 0.012（改善 99.8%，3种子，std < 0.0001）。

Grade分离范数分布 — 图3. 训练300步后各 grade 的 L2 范数——grade-2 最活跃（10维，关系信息），与理论预期一致。分工自然涌现，无需显式监督。

全 grade vs 仅 grade-0 解码

5.3×

0.006 vs 0.032 · 等变分量携带真实独立信息

6层推理后 grade-0 变化量

0.0

精确为零——端到端训练中确认

Grade-0 token区分能力 — 图4. 不同 token 的 grade-0 余弦相似度均值：0.029 ± 0.013（接近正交）。grade-0 具备强 token 区分能力。

Phase 4 Dry Run — 架构验证 ✅

参数量（n_channels=8）

10M

完整链路：编码器 + SlowFast + DualCodebook

峰值显存（n_channels=8）

295MB

固定大小——不随序列长度增长

🗺 实验计划

阶段	目标	状态
Phase 1	Cl(4,1) 数学性质验证（不变性、等变性、Eraser）	✅ 完成
Phase 2	编解码链路：TokenToIC → BIICLayer → AllGradeDecoder	✅ 完成
Phase 3	6组对照实验 · H1：几何结构 vs 正交约束 · H2：等变结构 vs 维度 · H3：Eraser 在长序列上的效果	🔄 运行中
Phase 4	MVP 语言模型：SlowFastBIIC + DualCodebook，WikiText-103，无残差 / 无 KV Cache	📋 计划中

⚡ 能解决什么问题

🔒 无损长上下文

Grade-0 无论推理多深都保持原始语义——代数定理保证，不是近似。

💾 不需要 KV Cache

可变态替代键值存储，显存占用固定——不随序列长度增长。

🔍 内建可解释性

Grade 分解直接区分"token 是什么"vs"推理知道什么"，每个 grade 有明确语义角色。

🌐 天然多模态对齐

不同模态映射到同一代数空间，grade-0 跨文字/图像/音频天然可比较，无需额外对齐训练。

📐 O(L) 复杂度

慢快分离架构消除二次方注意力。慢速网络每 K 步更新，快速网络每步读出。

🧹 主动遗忘

GradeAwareEraser 将等变分量受控衰减至语义先验，长推理中信息熵保持有界。

📍 与现有工作的本质区别

BIIC 不是新的 transformer 变体。它在任何 transformer 式处理发生之前，就替换了信息承载物本身——token。

方法	替换了什么	输入/输出	不变性保证
GATr (2023)	注意力机制	仍然是 token	仅 E(3) 等变
Versor (2026)	内部计算	仍然是 token	仅 SE(3) 等变
FoldToken (2024)	蛋白质结构 token	领域特定	SE(3) 不变编码器
BIIC（我们）	信息承载物本身	多向量（不变 + 等变）	Grade-0 由定理保证不变 · Eraser 精确保持不变核

🚀 快速开始

bash · Phase 1 验证（CPU，约2分钟）

pip install torch numpy scipy matplotlib

# Phase 1：数学验证（CPU 即可）
python tests/test_phase1.py

# Phase 2：链路验证（CPU，约10分钟）
python tests/test_decoder_basic.py
python tests/test_encoder.py
python tests/test_full_pipeline.py

📚 参考文献

Brehmer et al. (2023). Geometric Algebra Transformer (GATr). NeurIPS 2023.
Huy & Hirst (2026). Versor: A Geometric Sequence Architecture.
Ji (2026). CliffordNet: All You Need is Geometric Algebra.
Anonymous (2026). Toward a Functional Geometric Algebra for NLP.
Dasgupta et al. (2026). Invariant Features in Language Models.
Wu & Zhang (2017). TET 介导的主动 DNA 去甲基化. Nature Reviews Genetics.

💬 合作联系

🧬

Val Huang

独立研究者 · BIIC 作者

💬 微信：llmbbs

对这个方向感兴趣、愿意一起写论文、贡献实验或探索新范式的朋友，欢迎联系。

📄 引用

📜 许可证

Business Source License 1.1——研究和非生产使用免费。详见 LICENSE。