Auralis v0.1: A Compact NLP Model for Lightweight Instruction-Following and Real-Time Comprehension
Abstract
Auralis v0.1, developed by COREA Starstroupe, is a compact natural language processing (NLP) model engineered for lightweight instruction-following, dialog intent extraction, and low-latency comprehension on resource-constrained devices. This paper documents the model’s initial development cycle, emphasizing sequence disambiguation, hybrid grammar parsing, and token-aligned interpretability. Auralis leverages instruction-refined datasets and a hybrid Transformer-GRU architecture to achieve real-time performance in mobile AI and voice agent applications. With a sub-2MB parameter budget, Auralis delivers transparent, rule-traceable inference, aligning with Starstroupe’s open-source mission to advance accessible AI.
1. Purpose and Scope
Auralis addresses critical gaps in compact NLP systems, focusing on semantic transparency, instruction generalization, and real-time explainability. Designed for voice command interpretation, device-level dialog agents, and localized natural language understanding (NLU), Auralis tackles three challenges:
- Semantic Transparency: Providing clear, traceable inference paths in small-scale models.
- Instruction Generalization: Achieving reliable task performance with minimal pretraining.
- Real-Time Explainability: Enabling interpretable parsing and reasoning for user-facing applications.
The model’s goals include:
- Token-to-intent traceability via explainable inference paths.
- A parameter budget under 2MB for microcontroller compatibility.
- Embedded syntactic priors integrated into the training cycle.
Auralis targets applications in voice-enabled IoT devices, mobile assistants, and domain-specific NLU tasks, aligning with COREA Starstroupe’s non-profit mission.
2. Model Architecture: v0.1
Auralis v0.1 employs a hybrid Transformer-GRU architecture optimized for low-resource environments:
- Parameter Count: 1,978,000
- Architecture Style: Hybrid Transformer-GRU
- Number of Layers: 5 Hybrid Blocks
- Embedding Dimension: 112
- Max Token Length: 192
Dedicated Heads:
- Syntax Attention Head: Aligns part-of-speech (POS) tags and phrase chunks.
- Disambiguation Gate: Controls GRU recurrence for temporal disambiguation.
Each Hybrid Block comprises:
- Multi-Head Attention: 2 heads per block, scaled dot-product with 0.1 dropout.
- GRU Unit: Shared recurrent unit with gating for temporal pattern recognition.
- Mask Layer: Token position-modulated masking to enforce phrase consistency.
The Transformer-GRU hybrid enables parallel attention for contextual understanding and serial processing for sequence disambiguation, reducing latency by 18% compared to pure Transformer models of similar size. The architecture leverages a 112-dimensional embedding space to balance expressivity and memory efficiency.
3. Dataset Construction and Tuning Methodology
3.1 Dataset Composition
Auralis was trained on a diverse corpus totaling 12.1 million sequences:
- OpenAssistant Dialogue Trees: 4.2M conversational exchanges for dialog intent.
- Crowdsourced Instruction-Response Pairs: 3.5M prompts for task-specific NLU.
- Synthetic Grammar-Based Augmentations: 2.8M CFG-derived sequences.
- STARSTROUPE Task-Prompt Set v1.3: 1.6M domain-specific instructions.
A 72-rule context-free grammar (CFG) governed noun-verb chains, conditionals, and prepositional templates. CFG rules were injected during preprocessing via inline tagging tokens (e.g., [NP], [VP]).
3.2 Loss Objective
The training loss combined multiple objectives:
L = Ltoken + λ1 * Lphrase + λ2 * Lrule
Where:
- Ltoken: Token-wise cross-entropy loss
- Lphrase: Phrase structure distance (BLEU-style n-gram fidelity)
- Lrule: Rule-class match penalty for CFG alignment
- λ1 = 0.4, λ2 = 0.2
Solution: For a batch with Ltoken = 0.65, Lphrase = 0.3, Lrule = 0.1:
L = 0.65 + 0.4 * 0.3 + 0.2 * 0.1 = 0.65 + 0.12 + 0.02 = 0.79
3.3 Training Configuration
Training parameters:
- Hardware: 2x NVIDIA A100 GPUs (40GB each)
- Training Time: 14.2 hours
- Batch Size: 128
- Sequence Length: 192
- Epochs: 9
- Optimizer: AdamW (β₁=0.9, β₂=0.98, ε=1e-8, lr=2e-4)
Augmentations:
- Phrase reversal (e.g., “close door” → “door close”)
- Negation insertion (e.g., “open window” → “don’t open window”)
- Passive-to-active transformations (e.g., “window was opened” → “open window”)
Augmentations increased dataset robustness, improving generalization by 9.3% on unseen prompts.
4. Key Features
Auralis introduces three innovative features:
- Phrase-Aware Tokenization: A grammar-augmented tokenizer with embedded CFG hooks, aligning tokens to syntactic structures for improved phrase-level coherence.
- Explanation Hooks: Intermediate embeddings are tagged with rule-provenance annotations, enabling traceability to specific linguistic rules during inference.
- Instruction Generalization: Modular prompt abstraction supports 3-shot and 1-shot learning, aligning latent intents with minimal examples.
These features enhance Auralis’s suitability for real-time, interpretable NLP tasks on edge devices.
5. Interpretability Framework
Auralis provides rule-level explainability during inference:
- Forward Trace: Each predicted token carries a tag referencing its CFG rule lineage (e.g., [NP→N]).
- Score Calculation:
S(t) = Σ wr * δr,t
Where:
- S(t): Confidence score for token t
- wr: Learned weight for rule r
- δr,t: Binary indicator (1 if token t aligns with rule r, 0 otherwise)
Solution: For token t with rules r₁, r₂, weights w = [0.6, 0.3], δ = [1, 0]:
S(t) = 0.6 * 1 + 0.3 * 0 = 0.6
Interpretability scores guide debugging and provide real-time feedback, with 92.7% of instruction-following outputs traced to valid grammar rules.
6. Performance Benchmarks
Auralis was benchmarked on three low-resource devices: Raspberry Pi 5 (1.5GHz Quad-core Cortex-A76), ESP32-S3 (240MHz Dual-core), and Pixel 6 (Android NPU). Tasks included 2-step instruction following, yes/no intent classification, command rephrasing, and logical sequence resolution:
Task Description | Accuracy | Explanation Rate | Latency (ms) |
---|---|---|---|
Follow 2-step instruction | 91.2% | 92.7% | 97.3 |
Yes/No intent classification | 96.4% | N/A | 66.4 |
Command rephrasing | 84.8% | 78.2% | 113.5 |
Logical sequence resolution | 77.5% | 74.3% | 121.9 |
Explanation Rate: Percentage of outputs traced to a valid CFG rule lineage. The Pixel 6’s NPU reduced latency by 22% compared to the Pi 5, while the ESP32-S3’s limited SRAM (520KB) increased latency for complex tasks like logical resolution.
7. Deployment Footprint
Auralis is optimized for minimal resource usage:
- Compiled Model Size (Int8): 1.08MB
- Peak RAM Usage (runtime): 9.4MB
- Frameworks:
- TFLite Micro (quantized for microcontrollers)
- ONNX Export (cross-platform compatibility)
- NexoraLiteRT (custom runtime kernel, 35KB overhead)
- Supported Platforms:
- ARM Cortex-A series
- ESP32-S3
- Android Neural Networks API
The Int8 quantization preserved 98.7% of float16 accuracy, enabling deployment on microcontrollers with minimal degradation.
8. Conclusion
Auralis v0.1 represents a significant advancement in compact NLP, delivering interpretable, low-latency comprehension for embedded applications. Its hybrid architecture, grammar-augmented tokenization, and rule-traceable inference address critical needs in voice agents and localized NLU. As part of COREA Starstroupe’s open-source initiative, Auralis paves the way for accessible, transparent AI on edge devices, with future work focusing on multi-modal integration and enhanced generalization.
References
- Corea STARSTROUPE Auralis Design Notes. (2024). Internal Documentation.
- STARSTROUPE Task-Prompt Set v1.3. (2024). COREA Starstroupe.
- Chomsky, N. (1956). Three Models for the Description of Language. IRE Transactions on Information Theory.
- Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation.
- Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.