NeuroLite-4M: A Compact Small Language Model for Real-Time Edge Processing
Abstract
This paper details the foundational research behind Nexora AI’s initial prototype: a compact, energy-efficient small language model (SLM) designed for real-time edge processing and scalable intent detection. Built upon lightweight transformer variants and quantized inference strategies, Nexora introduces a minimal architecture named NeuroLite-4M, capable of high-throughput semantic inference with just 4 million parameters. The paper includes architectural blueprints, early benchmarks, and theoretical justifications for low-complexity natural language understanding.
1. Introduction
Large-scale NLP models often demand significant computational resources, making them impractical for edge devices or privacy-sensitive applications. Nexora AI, a project under COREA Starstroupe’s open-source initiative, develops NeuroLite-4M, a small language model (SLM) optimized for speed, efficiency, and cognitive compression. Designed for intent detection, query classification, and semantic entity extraction, NeuroLite-4M operates with minimal resources, supporting COREA Starstroupe’s mission to advance accessible human-machine interaction.
2. Model Architecture Overview
2.1 Base Configuration
NeuroLite-4M is defined by:
- Parameter Count: 4 million
- Architecture Type: Transformer Lite (6 layers, 8 heads, embedding dim = 128)
- Token Limit: 256
- Vocabulary: 16,384 BPE tokens
2.2 Layer Composition
Each transformer block consists of:
- LayerNorm → Self-Attention (multi-head) → Dropout (p=0.1)
- LayerNorm → Feed Forward (128→512→128)
The attention mechanism employs Linformer-style key compression, reducing complexity from O(n²) to O(n), where n is sequence length.
3. Quantization Strategy
Weights are quantized to 8-bit integers using symmetric per-tensor quantization for deployment on low-power CPUs and microcontrollers.
3.1 Quantization Function
Q(w) = round(w / s) * s, s = max(|w|) / 127
This minimizes distortion and is applied post-training. Quantization reduced model size from 14.5MB to 3.8MB and improved inference speed by 42% on ARM Cortex-A72.
4. Training Regime
4.1 Corpus Composition
The training corpus comprises:
- 2.1 million short English queries
- Domains: finance, weather, education, social, informal Q&A
- Tokenized with SentencePiece BPE
4.2 Objective
Masked Language Modeling (MLM) with 15% token masking per sequence.
4.3 Optimization
Training parameters:
- Optimizer: AdamW
- Batch size: 512
- Learning rate: 3e-4 with cosine decay
- Training time: 48 hours on 1 A100 GPU (shared)
Loss converged after 120k steps:
L = -Σ log P(wi | w1, ..., wi-1)
5. Intent Detection via Latent Embedding
Final token embeddings hL are extracted and mean-pooled:
hpool = (1/n) Σ hL,i
A 3-layer intent classifier is applied:
- Dense (128→64) + ReLU
- Dense (64→16) + ReLU
- Dense (16→5 intents) + Softmax
Early benchmarks:
Task | Accuracy |
---|---|
Binary Intent Match | 91.2% |
Multi-class Intent ID | 84.5% |
Semantic Clustering | 78.3% |
Intent classes: query, command, question, affirmation, cancellation.
6. Energy Efficiency Analysis
6.1 Compute Profile (per 100 inferences)
Performance on Raspberry Pi 4:
- Runtime: 0.98s total
- Peak RAM: 73MB
- Energy Draw: ~0.11 Wh
Compared to distilled BERT:
- 3.2× faster
- 4.7× less energy
7. Conclusion and Next Steps
NeuroLite-4M, developed under COREA Starstroupe’s Nexora AI, demonstrates the efficacy of small language models in constrained environments, supporting ambient intent recognition and privacy-first applications. Future work in 2024 will focus on multilingual support, context memory, and task-specific fine-tuning, advancing COREA Starstroupe’s open-source mission.
Appendix: Layer Weight Distribution (Truncated)
Layer | Mean Weight | Std Dev | Max Weight | Min Weight |
---|---|---|---|---|
L1 | 0.034 | 0.122 | 0.98 | -0.91 |
L4 | 0.029 | 0.087 | 0.76 | -0.66 |
L6 | 0.031 | 0.113 | 0.82 | -0.73 |
References
- Corea STARSTROUPE Internal Model Architecture Sheets. (2023). Nexora AI Specifications.
- Nexora AI Pretraining Logs – Build NL-4M-Alpha. (2023). COREA Starstroupe.
- Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
- Wu, Y., et al. (2019). Quantization for Efficient Inference of Deep Neural Networks. arXiv preprint arXiv:1910.05488.