Product Info

Godel Base B‑1:

A 200M-Parameter Sparse Mixture-of-Experts Language Model

At Atom Technologies, we are rethinking how large language models can be scaled efficiently while remaining accessible and performant across diverse applications. Godel Base B‑1 represents the first milestone in this journey: a mid-sized, open-source model designed to balance parameter count, inference speed, and modular architecture.

Model Overview

Godel Base B‑1 is a 200 million parameter transformer language model that integrates a Mixture-of-Experts (MoE) routing layer to deliver greater capacity without linear growth in compute requirements.

This architecture is built from the ground up for:

  • Efficient inference under constrained hardware budgets

  • Modular scaling by adding additional experts

  • Compatibility with modern tokenizers (like our T‑1) and training pipelines

Architecture Details

Godel Base B‑1 combines:

  • Dense transformer backbone
    A standard transformer encoder-decoder stack responsible for contextual representation learning.

  • Sparse MoE layers
    Selectively activated experts controlled by a gating network, allowing the model to dynamically specialize without incurring the full cost of dense layers.

Core Components

1. Embedding Layer

  • Vocabulary Size: 6,000 (T‑1 tokenizer)

  • Embedding Dimension: 512

2. Transformer Blocks

  • 24 transformer layers

  • Multi-head self-attention (8 attention heads per layer)

  • Pre-layer normalization

  • Feed-forward hidden dimension: 2048

3. MoE Layer

  • 16 experts in total

  • Sparse routing: Top-2 expert selection per token

  • Each expert: independent feed-forward network

  • Gating network: softmax gating probabilities to control expert activation

4. Output Projection

  • Linear projection to vocabulary logits

  • Softmax for final token probabilities

Mixture-of-Experts Design

Godel Base B‑1’s MoE implementation is inspired by approaches proven in research (Switch Transformer, DeepSeek-V3):

  • Sparse Activation: Only 2 out of 16 experts are active for each input token.

  • Load Balancing Loss: Auxiliary loss term encourages uniform utilization of experts.

  • Capacity Factor: Adjustable hyperparameter controlling expert selection granularity.

This enables a model with an effective capacity far larger than a dense 200M-parameter transformer, while maintaining efficient memory and compute usage.

Training Objectives

The model is pretrained on large-scale English and domain-specific corpora using:

  • Masked Language Modeling

  • Next Token Prediction

  • Auxiliary MoE load balancing regularization

Optimization uses AdamW with a cosine learning rate schedule and mixed-precision training for speed and stability.

Performance Goals

Godel Base B‑1 is designed to deliver:

  • Competitive perplexity relative to larger dense models

  • Faster inference due to sparse expert activation

  • Robust handling of domain-specific text without specialized fine-tuning

While smaller than flagship multi-billion parameter LLMs, B‑1 prioritizes deployability and adaptability.

Open-Source Roadmap

Godel Base B‑1 will be fully open-sourced, including:

  • Pretrained weights

  • Model architecture definitions (PyTorch)

  • Tokenizer artifacts

  • Training recipes and hyperparameters

Our goal is to make high-quality foundation models transparent, reproducible, and accessible to the broader ML community.

Use Cases

Potential applications include:

  • General-purpose text generation and summarization

  • Task-oriented dialog systems

  • Code completion and lightweight programming assistance

  • Research and experimentation with sparse expert models

Conclusion

Godel Base B‑1 is a new approach to scaling mid-sized language models: combining sparse Mixture-of-Experts with a streamlined transformer backbone and open-source philosophy. This architecture establishes a flexible foundation for future models and extensions, including larger parameter counts and more specialized expert modules.

Stay tuned for model releases, technical papers, and integration guides.