NeuroLite-4M: A Compact Small Language Model for Real-Time Edge Processing

Ron Asnahon, December 2023

Abstract

This paper details the foundational research behind Nexora AI’s initial prototype: a compact, energy-efficient small language model (SLM) designed for real-time edge processing and scalable intent detection. Built upon lightweight transformer variants and quantized inference strategies, Nexora introduces a minimal architecture named NeuroLite-4M, capable of high-throughput semantic inference with just 4 million parameters. The paper includes architectural blueprints, early benchmarks, and theoretical justifications for low-complexity natural language understanding.

1. Introduction

Large-scale NLP models often demand significant computational resources, making them impractical for edge devices or privacy-sensitive applications. Nexora AI, a project under COREA Starstroupe’s open-source initiative, develops NeuroLite-4M, a small language model (SLM) optimized for speed, efficiency, and cognitive compression. Designed for intent detection, query classification, and semantic entity extraction, NeuroLite-4M operates with minimal resources, supporting COREA Starstroupe’s mission to advance accessible human-machine interaction.

2. Model Architecture Overview

2.1 Base Configuration

NeuroLite-4M is defined by:

Parameter Count: 4 million
Architecture Type: Transformer Lite (6 layers, 8 heads, embedding dim = 128)
Token Limit: 256
Vocabulary: 16,384 BPE tokens

2.2 Layer Composition

Each transformer block consists of:

LayerNorm → Self-Attention (multi-head) → Dropout (p=0.1)
LayerNorm → Feed Forward (128→512→128)

The attention mechanism employs Linformer-style key compression, reducing complexity from O(n²) to O(n), where n is sequence length.

3. Quantization Strategy

Weights are quantized to 8-bit integers using symmetric per-tensor quantization for deployment on low-power CPUs and microcontrollers.

3.1 Quantization Function

Q(w) = round(w / s) * s, s = max(|w|) / 127

This minimizes distortion and is applied post-training. Quantization reduced model size from 14.5MB to 3.8MB and improved inference speed by 42% on ARM Cortex-A72.

4. Training Regime

4.1 Corpus Composition

The training corpus comprises:

2.1 million short English queries
Domains: finance, weather, education, social, informal Q&A
Tokenized with SentencePiece BPE

4.2 Objective

Masked Language Modeling (MLM) with 15% token masking per sequence.

4.3 Optimization

Training parameters:

Optimizer: AdamW
Batch size: 512
Learning rate: 3e-4 with cosine decay
Training time: 48 hours on 1 A100 GPU (shared)

Loss converged after 120k steps:

L = -Σ log P(w_i | w₁, ..., w_i-1)

5. Intent Detection via Latent Embedding

Final token embeddings h_L are extracted and mean-pooled:

h_pool = (1/n) Σ h_L,i

A 3-layer intent classifier is applied:

Dense (128→64) + ReLU
Dense (64→16) + ReLU
Dense (16→5 intents) + Softmax

Early benchmarks:

Task	Accuracy
Binary Intent Match	91.2%
Multi-class Intent ID	84.5%
Semantic Clustering	78.3%

Intent classes: query, command, question, affirmation, cancellation.

6. Energy Efficiency Analysis

6.1 Compute Profile (per 100 inferences)

Performance on Raspberry Pi 4:

Runtime: 0.98s total
Peak RAM: 73MB
Energy Draw: ~0.11 Wh

Compared to distilled BERT:

3.2× faster
4.7× less energy

7. Conclusion and Next Steps

NeuroLite-4M, developed under COREA Starstroupe’s Nexora AI, demonstrates the efficacy of small language models in constrained environments, supporting ambient intent recognition and privacy-first applications. Future work in 2024 will focus on multilingual support, context memory, and task-specific fine-tuning, advancing COREA Starstroupe’s open-source mission.

Appendix: Layer Weight Distribution (Truncated)

Layer	Mean Weight	Std Dev	Max Weight	Min Weight
L1	0.034	0.122	0.98	-0.91
L4	0.029	0.087	0.76	-0.66
L6	0.031	0.113	0.82	-0.73

References

Corea STARSTROUPE Internal Model Architecture Sheets. (2023). Nexora AI Specifications.
Nexora AI Pretraining Logs – Build NL-4M-Alpha. (2023). COREA Starstroupe.
Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
Wu, Y., et al. (2019). Quantization for Efficient Inference of Deep Neural Networks. arXiv preprint arXiv:1910.05488.