Nexora Proto v0.1: Foundational Build of a Compact NLP Stack for Resource-Constrained Devices

Ron Asnahon, November 2023

Abstract

This paper details the initial development of Nexora Proto v0.1, COREA Starstroupe’s foundational natural language processing (NLP) stack tailored for resource-constrained environments. Released internally in November 2023, Nexora introduces a compact model with 0.89 million parameters, a frequency-scaled tokenization pipeline, and a loss-scaling optimizer. The architecture prioritizes minimal memory footprints and efficient embeddings to enable NLP on microcontrollers and low-power ARM processors. We present the design philosophy, tokenization system, training regime, quantization experiments, embedding stabilization techniques, and early inference benchmarks, laying the groundwork for future advancements in compression, token prioritization, and instruction tuning.

1. Design Philosophy

Nexora is engineered to operate efficiently on hardware with severe computational and memory constraints, such as microcontrollers and single-core ARM processors. Unlike conventional language models optimized for GPU clusters, Nexora scales downward to fit within a 10MB RAM ceiling while maintaining usable linguistic capabilities. The initial phase focused on three core objectives:

Developing the smallest viable NLP model with robust linguistic priors, capable of basic comprehension and generation tasks.
Implementing a deterministic tokenizer with entropy regulation to ensure stable and compact token representations.
Creating a context-aware embedding system to capture semantic relationships with minimal drift in low-resource settings.

These goals align with COREA Starstroupe’s non-profit mission to democratize AI through open-source, lightweight solutions for edge devices.

2. Initial Model Architecture

Nexora Proto v0.1 is a lightweight transformer-based model designed for balance between computational efficiency and linguistic performance:

Model: Nexora Proto v0.1
Parameters: 0.89M (890,000)
Hidden Size: 96
Heads: 3 (multi-head attention)
Layers: 4 transformer blocks
Token Limit: 128 tokens
Positional Encoding: Fixed sinusoidal (non-learnable)

The architecture was selected after extensive experimentation to optimize convergence speed on a subsampled dataset while adhering to a memory footprint of ≤10MB. Each transformer block includes layer normalization, multi-head attention, and a feed-forward network with residual connections. The sinusoidal positional encoding ensures deterministic context awareness without additional parameters, critical for low-memory devices.

3. Tokenization System

Nexora employs a frequency-scaled BytePair Encoding (BPE) variant with entropy clamping to maintain stable token representations. Token entropy H_t is computed over batches to regulate representation diversity:

H_t = -Σ p(t_i) * log₂(p(t_i))

Where:

p(t_i): Probability of token t_i in the batch
H_t: Entropy in bits

Solution: For a batch with tokens [t₁, t₂, t₃] and probabilities p = [0.5, 0.3, 0.2]:

H_t = -(0.5 * log₂(0.5) + 0.3 * log₂(0.3) + 0.2 * log₂(0.2))

= -(0.5 * -1 + 0.3 * -1.737 + 0.2 * -2.322) = 0.5 + 0.5211 + 0.4644 ≈ 1.4855 bits

If H_t exceeds 3.5 bits, low-probability token IDs (p(t_i) < 0.05) are culled for that batch to prevent overfitting to rare tokens. Initial tokenizer vocabulary size is 3,100 tokens, optimized for a multilingual subsampled corpus.

4. Training Regime

4.1 Dataset and Preprocessing

Training utilized a subsampled multilingual corpus:

Size: 3.1M sequences (pre-cleaned)
Languages: English, Spanish, French (balanced distribution)
Preprocessing: Lowercasing, punctuation normalization, stop-word retention for context

The corpus was curated to ensure diversity in sentence length (mean: 12 tokens, max: 128 tokens) and semantic complexity, suitable for low-resource NLP tasks.

4.2 Loss Function

A smoothed cross-entropy loss with an adaptive floor was used to stabilize training:

L = -Σ [y_i * log(p_i)] + λ * max(0, L_floor - L_current)

Where:

y_i: True label (one-hot)
p_i: Predicted probability
λ: Smoothing factor (0.1)
L_floor: Adaptive loss floor (initially 1.5, adjusted by 0.05 per epoch)

Solution: For a batch with true labels y = [1, 0, 0], predictions p = [0.7, 0.2, 0.1], L_floor = 1.5, L_current = -log(0.7) ≈ 0.3567:

L = -log(0.7) + 0.1 * max(0, 1.5 - 0.3567) ≈ 0.3567 + 0.1 * 1.1433 ≈ 0.4710

4.3 Optimization

Training parameters:

Optimizer: AdamW (β₁=0.9, β₂=0.98, ε=1e-8)
Learning Rate: Linear warm-up over 1,600 steps to 2e-4, cosine decay to 1e-6
Batch Size: 64
Sequence Length: 128
Epochs: 12
Hardware: Single NVIDIA Jetson Nano (4GB RAM, 128-core Maxwell GPU)

4.4 Loss Convergence

Observed loss convergence (average over validation set):

Step 0: 4.92
Step 2,000: 2.78
Step 8,000: 1.92

Convergence was achieved at approximately 10,000 steps, with a final validation loss of 1.85, indicating stable learning despite the model’s small size.

5. Quantization Study (Preliminary)

Post-training quantization experiments were conducted to assess model compression feasibility:

INT8 Quantization: Reduced model size from 3.56MB (float32) to 0.89MB but caused 10–13% degradation in top-1 token accuracy (from 84.1% to 73.2–74.9%).
Binary Quantization: Collapsed core embeddings, resulting in >50% accuracy loss, rendering the model unusable.

Decision: Float32 weights were retained for production to preserve linguistic fidelity, pending further quantization research.

6. Embedding Dynamics

Embedding drift, a common challenge in small models, was monitored using the L2-norm difference between embeddings at consecutive training steps:

D(t) = ||e_i(t) - e_i(t-1)||₂

Where:

e_i(t): Embedding of token i at step t
D(t): Drift magnitude

Solution: For embeddings e_i(t) = [0.3, 0.4], e_i(t-1) = [0.2, 0.35]:

D(t) = sqrt((0.3-0.2)² + (0.4-0.35)²) = sqrt(0.01 + 0.0025) = sqrt(0.0125) ≈ 0.1118

Mean drift at 10,000 steps: 0.118 (L2-norm), indicating moderate instability.

To mitigate drift, a regularization penalty was introduced:

L_reg = μ * Σ ||e_i(t) - e_i(t-1)||₂

Where μ = 0.01. This reduced mean drift to 0.092 by step 12,000, improving embedding stability.

7. Inference Metrics (Early Build)

Benchmarked on Raspberry Pi Zero 4 W (64MB RAM, 1GHz quad-core Cortex-A53):

Metric	Baseline
Inference Latency	64.2 ms
Peak RAM Usage	48.3 MB
Token Accuracy Top-5	85.1%

Inference was tested on a 128-token test set with basic conversational prompts. Latency was dominated by memory access (70%) due to the Pi’s limited DRAM bandwidth. Top-5 accuracy reflects the model’s ability to predict relevant tokens in low-context scenarios.

8. Observations and Limitations

Key challenges identified in the November 2024 build:

Token Entropy Fluctuations High entropy (H_t > 4 bits) in some batches led to unstable embeddings for rare tokens, necessitating manual re-weighting of high-frequency tokens.
Embedding Collapse High-frequency tokens occasionally converged to similar vectors, reducing model expressivity. This was partially mitigated by increasing λ in the loss function to 0.15.
Layer Normalization Dynamics Layer norm parameters (γ, β) exhibited oscillatory behavior at batch restarts, causing temporary loss spikes (e.g., +0.3 at epoch 5). A smaller learning rate (1e-5) for norm parameters resolved this.
Static Positional Encoding Fixed sinusoidal encodings limited the model’s ability to generalize to variable-length sequences, reducing contextual reusability in multi-sentence inputs.

9. Conclusion

Nexora Proto v0.1 establishes a robust foundation for compact NLP on resource-constrained devices, aligning with COREA Starstroupe’s mission to deliver open-source AI solutions. Despite its modest capabilities, the model demonstrates viable performance on microcontrollers, with a stable training pipeline and manageable memory footprint. Future work will address limitations through dynamic token prioritization, advanced quantization techniques, and instruction-tuned adaptations to enhance robustness and versatility.

References

Corea STARSTROUPE Internal Documentation. (2023). Nexora Proto v0.1 Design Notes.
Corea STARSTROUPE
Sennrich, R., et al. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv preprint arXiv:1508.07909.
Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.
Wu, Y., et al. (2016). Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. arXiv preprint arXiv:1609.06161.