Auralis v0.1: A Compact NLP Model for Lightweight Instruction-Following and Real-Time Comprehension

Ron Asnahon, July 2024

Abstract

Auralis v0.1, developed by COREA Starstroupe, is a compact natural language processing (NLP) model engineered for lightweight instruction-following, dialog intent extraction, and low-latency comprehension on resource-constrained devices. This paper documents the model’s initial development cycle, emphasizing sequence disambiguation, hybrid grammar parsing, and token-aligned interpretability. Auralis leverages instruction-refined datasets and a hybrid Transformer-GRU architecture to achieve real-time performance in mobile AI and voice agent applications. With a sub-2MB parameter budget, Auralis delivers transparent, rule-traceable inference, aligning with Starstroupe’s open-source mission to advance accessible AI.

1. Purpose and Scope

Auralis addresses critical gaps in compact NLP systems, focusing on semantic transparency, instruction generalization, and real-time explainability. Designed for voice command interpretation, device-level dialog agents, and localized natural language understanding (NLU), Auralis tackles three challenges:

Semantic Transparency: Providing clear, traceable inference paths in small-scale models.
Instruction Generalization: Achieving reliable task performance with minimal pretraining.
Real-Time Explainability: Enabling interpretable parsing and reasoning for user-facing applications.

The model’s goals include:

Token-to-intent traceability via explainable inference paths.
A parameter budget under 2MB for microcontroller compatibility.
Embedded syntactic priors integrated into the training cycle.

Auralis targets applications in voice-enabled IoT devices, mobile assistants, and domain-specific NLU tasks, aligning with COREA Starstroupe’s non-profit mission.

2. Model Architecture: v0.1

Auralis v0.1 employs a hybrid Transformer-GRU architecture optimized for low-resource environments:

Parameter Count: 1,978,000
Architecture Style: Hybrid Transformer-GRU
Number of Layers: 5 Hybrid Blocks
Embedding Dimension: 112
Max Token Length: 192

Dedicated Heads:

Syntax Attention Head: Aligns part-of-speech (POS) tags and phrase chunks.
Disambiguation Gate: Controls GRU recurrence for temporal disambiguation.

Each Hybrid Block comprises:

Multi-Head Attention: 2 heads per block, scaled dot-product with 0.1 dropout.
GRU Unit: Shared recurrent unit with gating for temporal pattern recognition.
Mask Layer: Token position-modulated masking to enforce phrase consistency.

The Transformer-GRU hybrid enables parallel attention for contextual understanding and serial processing for sequence disambiguation, reducing latency by 18% compared to pure Transformer models of similar size. The architecture leverages a 112-dimensional embedding space to balance expressivity and memory efficiency.

3. Dataset Construction and Tuning Methodology

3.1 Dataset Composition

Auralis was trained on a diverse corpus totaling 12.1 million sequences:

OpenAssistant Dialogue Trees: 4.2M conversational exchanges for dialog intent.
Crowdsourced Instruction-Response Pairs: 3.5M prompts for task-specific NLU.
Synthetic Grammar-Based Augmentations: 2.8M CFG-derived sequences.
STARSTROUPE Task-Prompt Set v1.3: 1.6M domain-specific instructions.

A 72-rule context-free grammar (CFG) governed noun-verb chains, conditionals, and prepositional templates. CFG rules were injected during preprocessing via inline tagging tokens (e.g., [NP], [VP]).

3.2 Loss Objective

The training loss combined multiple objectives:

L = L_token + λ₁ * L_phrase + λ₂ * L_rule

Where:

L_token: Token-wise cross-entropy loss
L_phrase: Phrase structure distance (BLEU-style n-gram fidelity)
L_rule: Rule-class match penalty for CFG alignment
λ₁ = 0.4, λ₂ = 0.2

Solution: For a batch with L_token = 0.65, L_phrase = 0.3, L_rule = 0.1:

L = 0.65 + 0.4 * 0.3 + 0.2 * 0.1 = 0.65 + 0.12 + 0.02 = 0.79

3.3 Training Configuration

Training parameters:

Hardware: 2x NVIDIA A100 GPUs (40GB each)
Training Time: 14.2 hours
Batch Size: 128
Sequence Length: 192
Epochs: 9
Optimizer: AdamW (β₁=0.9, β₂=0.98, ε=1e-8, lr=2e-4)

Augmentations:

Phrase reversal (e.g., “close door” → “door close”)
Negation insertion (e.g., “open window” → “don’t open window”)
Passive-to-active transformations (e.g., “window was opened” → “open window”)

Augmentations increased dataset robustness, improving generalization by 9.3% on unseen prompts.

4. Key Features

Auralis introduces three innovative features:

Phrase-Aware Tokenization: A grammar-augmented tokenizer with embedded CFG hooks, aligning tokens to syntactic structures for improved phrase-level coherence.
Explanation Hooks: Intermediate embeddings are tagged with rule-provenance annotations, enabling traceability to specific linguistic rules during inference.
Instruction Generalization: Modular prompt abstraction supports 3-shot and 1-shot learning, aligning latent intents with minimal examples.

These features enhance Auralis’s suitability for real-time, interpretable NLP tasks on edge devices.

5. Interpretability Framework

Auralis provides rule-level explainability during inference:

Forward Trace: Each predicted token carries a tag referencing its CFG rule lineage (e.g., [NP→N]).
Score Calculation:

S(t) = Σ w_r * δ_r,t

Where:

S(t): Confidence score for token t
w_r: Learned weight for rule r
δ_r,t: Binary indicator (1 if token t aligns with rule r, 0 otherwise)

Solution: For token t with rules r₁, r₂, weights w = [0.6, 0.3], δ = [1, 0]:

S(t) = 0.6 * 1 + 0.3 * 0 = 0.6

Interpretability scores guide debugging and provide real-time feedback, with 92.7% of instruction-following outputs traced to valid grammar rules.

6. Performance Benchmarks

Auralis was benchmarked on three low-resource devices: Raspberry Pi 5 (1.5GHz Quad-core Cortex-A76), ESP32-S3 (240MHz Dual-core), and Pixel 6 (Android NPU). Tasks included 2-step instruction following, yes/no intent classification, command rephrasing, and logical sequence resolution:

Task Description	Accuracy	Explanation Rate	Latency (ms)
Follow 2-step instruction	91.2%	92.7%	97.3
Yes/No intent classification	96.4%	N/A	66.4
Command rephrasing	84.8%	78.2%	113.5
Logical sequence resolution	77.5%	74.3%	121.9

Explanation Rate: Percentage of outputs traced to a valid CFG rule lineage. The Pixel 6’s NPU reduced latency by 22% compared to the Pi 5, while the ESP32-S3’s limited SRAM (520KB) increased latency for complex tasks like logical resolution.

7. Deployment Footprint

Auralis is optimized for minimal resource usage:

Compiled Model Size (Int8): 1.08MB
Peak RAM Usage (runtime): 9.4MB
Frameworks:

TFLite Micro (quantized for microcontrollers)
ONNX Export (cross-platform compatibility)
NexoraLiteRT (custom runtime kernel, 35KB overhead)

Supported Platforms:

ARM Cortex-A series
ESP32-S3
Android Neural Networks API

The Int8 quantization preserved 98.7% of float16 accuracy, enabling deployment on microcontrollers with minimal degradation.

8. Conclusion

Auralis v0.1 represents a significant advancement in compact NLP, delivering interpretable, low-latency comprehension for embedded applications. Its hybrid architecture, grammar-augmented tokenization, and rule-traceable inference address critical needs in voice agents and localized NLU. As part of COREA Starstroupe’s open-source initiative, Auralis paves the way for accessible, transparent AI on edge devices, with future work focusing on multi-modal integration and enhanced generalization.

References

Corea STARSTROUPE Auralis Design Notes. (2024). Internal Documentation.
STARSTROUPE Task-Prompt Set v1.3. (2024). COREA Starstroupe.
Chomsky, N. (1956). Three Models for the Description of Language. IRE Transactions on Information Theory.
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation.
Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.