An Interactive Affective Computing Framework Utilizing NoSQL Architecture and Explainable Large Language Models for Real-Time Cognitive State Tracking

Ruziev Xamrokul; Davlatova Navbahor; Daminov Bobosher; Abdurasulov Anvar; Murodjonova Yulduz; Ganiyeva  Inobat

doi:10.21070/acopen.11.2026.14667

Ruziev Xamrokul ⁽¹⁾, Davlatova Navbahor ⁽²⁾, Daminov Bobosher ⁽³⁾, Abdurasulov Anvar ⁽⁴⁾, Murodjonova Yulduz ⁽⁵⁾, Ganiyeva Inobat ⁽⁶⁾

(1) University of Economics and Pedagogy, Uzbekistan, Uzbekistan

(2) Hardware and Software Support of Computer Systems, Karshi State Technical University, Uzbekistan

(3) Postal Communication Systems, Karshi State Technical University, Uzbekistan

(4) Information Technology Service, Karshi State Technical University, Uzbekistan

(5) Karshi State Technical University, Uzbekistan

(6) Computer Engineering, Karshi State Technical University, Uzbekistan

Fulltext View | Download View | Download

Abstract:

General Background Modern software engineering systems are currently shifting from traditional command-driven interfaces toward emotionally intelligent human-computer interaction frameworks. Specific Background Within this domain, tracking facial expressions via multi-dimensional symbolic blendshape weight vectors represents the primary modality for identifying continuous human affect without the massive storage overhead associated with raw multimedia files. Knowledge Gap However, conventional configurations frequently experience severe transactional write-amplification bottlenecks in legacy relational storage tiers alongside semantic parsing constraints and black-box opacity within generative artificial intelligence components. Aims To resolve these structural performance limitations, this paper introduces a novel, decoupled interactive affective computing framework that isolates real-time telemetry from heavy back-end computational processing. Results Empirically, the deployment of a specialized Document-Oriented Affective Database Schema within a non-relational environment maintains an ultra-low database write latency between 3.2 ms and 8.9 ms under concurrent high-throughput workloads, while the implementation of a Contextual Glyph Generator slashes language model context ingestion overhead by 74.2%. Furthermore, the framework achieves an expert-verified F1-score of 0.941 across complex emotional sub-clusters with an end-to-end multi-turn runtime latency of 112.5 ms, sitting comfortably beneath the critical 150 ms human-cognitive interactivity threshold. Novelty This architecture introduces a unique symbolic translation pathway that programmatically converts raw numerical coordinate streams into discrete structural textual tokens, ensuring that downstream transformer processes run with zero hallucination anomalies. Implications These findings demonstrate that explanation-driven transformer algorithms can be reliably deployed as transparent, accountable, and stable control layers within high-stakes computer engineering and behavioral monitoring environments.

Keywords: Affective Computing, Explainable Artificial Intelligence, NoSQL Database Architecture, Real-Time Cognitive Tracking, Human-Computer Interaction

Key Findings Highlights
Document-oriented database configurations achieve an ultra-low write latency profile under concurrent high-throughput stream workloads.
Contextual glyph generation reduces downstream large language model context ingestion overhead by over seventy percent.
Explanation-driven validation protocols completely eliminate text generation hallucination anomalies during multi-turn interactive dialogue states.

Downloads

Download data is not yet available.

I. INTRODUCTION

Today's computer engineering and software systems are entering an age of change from the previous command-driven interaction to one that is intuitive and emotionally intelligent—human-computer interaction (HCI). The field of affective computing has become an important one that seeks to unravel the emotions of human beings and allow computational systems to identify, understand, and respond to them. The key problem in this domain is the management and analysis of facial expressions, the most expressive and dynamic modality of human emotion. In practice, however, the extraction of faces and the storage of high dimensional multimedia data in real-time poses a tremendous challenge to the conventional architecture and process of managing databases, and requires algorithmic solutions that are more scalable and intelligent.

Traditionally, computer vision systems have used a categorical approach by assigning each complex human expression to one of a few discrete emotional categories (e.g., joy, sadness, anger). This reductive strategy removes the richness of affective states, their continuousity and blending, and results in a great loss of semantic meaning when ingesting them into the database. In order to overcome these limitations, today's database-based technologies need to move towards symbolic blendshapes: vectors of numbers and mathematics that model at which exact degree each facial muscle is activated. Computing systems can significantly decrease the overhead of the database while still maintaining the spatial and structural integrity of facial geometry by storing expressions as a structured multi-dimensional symbolic data, rather than the raw video frames.

At the same time, the emergence of Large Language Models (LLMs) has given researchers access to newfound powers for semantic reasoning and handling of contextually relevant data. The architecture of LLMs is well suited to tokenizing and parsing structural symbolic encodings, and they are uniquely adapted at understanding multi-dimensional blendshape vectors. LLMs can be seamlessly integrated into the affective database architecture, enabling the development of interactive, orchestrating algorithms that convert raw coordinate streams into comprehensible semantic descriptions. This integration is key to the idea of Explainable Artificial Intelligence (XAI), making sure the system's emotional categorizations are not only statistically sound but also humanly explainable by the system's operators.

On the computer systems side, query optimization and real-time algorithmic efficiency are the main challenges in orchestrating high-frequency affective streams with LLM. Many traditional relational database management systems (RDBMS) are not suitable to handle the low-latency, high-throughput demands of continuous blendshape indexing, and can often cause processing delays that can cause a real-time interaction loop to be broken. Thus, creating tailor-made interactive algorithms that efficiently parse, index and query affective databases using the LLM-driven pipelines is crucial. To tackle this technical challenge, this paper proposes a novel algorithmic framework to optimize the structural processing of facial blendshapes and thus improve the next generation affective computing system's query responses time and semantic classification accuracy.

Literature Review

The core concepts of affective computing were set forth in the study by Picard, R.W. [1] which described the way that a computational system can produce and identify a human emotion. The author emphasizes the importance of computational intelligence for the understanding of physiological and behavioral representations of affect. The monograph highlights the need for considering emotional data as structured inputs, as one of the key components for improving the intelligence of human-computer interactions loops. Furthermore, the need for strong scientific bases in emotional recognition for building digital systems is emphasized, which is an important scientific contribution for the system developed in this research. The results of the overall study confirm the strategic relevance of affective technologies in the development of computer engineering in the modern era.

Moreover, this pioneering research identifies the hardware and software constraints of the early phase of affective structures. The main challenge for emotional computing is that there is no common standard for data formats and models for online inference, as described by Picard. This research will create a link between human psychology and digital signaling, thereby confirming the possibility of implementing a more sophisticated digital database management architecture for storing high-frequency emotional features. For our paper, this structural basis makes it sensible to move from a classical interface tracking structure to complicated orchestration pipelines using databases.

In the study by Ekman, P. and Friesen, W.V. [2], the Facial Action Coding System (FACS) was developed, creating a systematic standard for classifying human facial expressions based on individual muscle movements called Action Units (AUs). The authors highlight the significance of decomposing complex facial aesthetics into discrete, observable physical tokens. The article emphasizes the integration of these anatomical tokens as a key factor for enhancing the geometric precision of computer vision algorithms. Additionally, the necessity of a standardized, objective matrix for muscle contraction measurements is stressed, providing an important scientific foundation for the symbolic representations utilized in this research. Overall, the study demonstrates the strategic importance of structural facial coding in automated behavioral analysis.

Moreover, the FACS methodology directly informs the computational logic behind modern digital puppetry and facial feature tracking. By mapping specific psychological emotions to quantifiable combinations of Action Units, this research bridges human biology with mathematical structures. In our proposed framework, this biological discretization acts as the primary logical precursor to symbolic blendshapes. Ekman and Friesen's work effectively validates why facial movements can—and should—be indexed within an affective database as parameterized multi-dimensional vectors rather than unparsed multimedia objects.

Lewis, J.P. et al. [3] deeply investigated concepts and mathematical structures of blendshape interpolation for facial animation and modelling. The authors emphasize the importance of linear combination techniques for facial shape reconstruction from a library of base shapes in order to generate realistic facial expressions.The importance of linear combination techniques for reconstructing facial shape from a set of base shapes in order to generate realistic facial expressions is discussed. Geometric weight vectors are highlighted as a crucial component for improving facial performance capture and manipulation efficiency. Furthermore, the requirement for weight tracking systems that work with low latency is emphasized, which offers a relevant scientific basis for the methods for database optimization developed in this research. The study, overall, illustrates the strategic value of the blendshape technology in the high fidelity graphics and computer animation pipeline.

In addition, Lewis et al. look at the computational cost of handling dense geometric meshes in real-time rendering. The paper points out that the operation of raw vertex data is expensive and suggests to store local deformation weights as a better alternative for network transmission and processing. Our database structure has been greatly supported by this algorithmic insight that was gained. We use a compressed symbolic blendshape weight vector representation to reduce the size of the database and speed up the transactional query.

Vaswani, A. et al. [4] introduced the Attention mechanism and the Transformer architecture which changed the game for the processing of sequential and structural data. The authors emphasize the importance of the self-attention layers in learning long-range relationships and semantics in complex datasets without recurrent networks. The study highlights the importance of introducing parallel token processing as a crucial factor for improving the computational power of intelligent models. Furthermore, the need for scalable sequence-to-sequence mapping frameworks is emphasized, laying a crucial scientific groundwork for the orchestration algorithms for LLM that are developed in this research. In conclusion, the research highlights the transformative role of AI-driven technologies in the digital transformation of AI.

Additionally, the mathematical framework laid out in this study transcends natural language processing, proving highly effective for any structured sequence of data. The ability of the self-attention mechanism to weigh the relative importance of different tokens concurrently allows it to interpret multi-dimensional data vectors with high semantic accuracy. In our research, this architectural capability is directly leveraged to analyze the relationships between simultaneous blendshape activations. The transformer's core logic justifies how an LLM can ingest raw numerical database records and output cohesive, explainable descriptions of human affect.

The study performed by Radford et al. [5] tested the ability of a generative pre-trained language model (GPT) to perform complex semantic reasoning and to handle zero-shot task performance. The authors emphasize the importance of unsupervised scaling in allowing LLM's to learn contextual and structural patterns. The paper highlights the use of large scale cross domain data as a critical enablers for improving the conversational and analytical intelligence of autonomous agents. Further, the need for flexible and dynamic text processing engines is emphasized, which are important scientific basis for the interactive, prompt-driven pipelines used in this work. In conclusion, the study highlights the strategic role of generative AI in cross-disciplinary computing.

In addition to text generation, this work also demonstrates LLMs' ability to serve as centralized orchestrators for downstream technical tasks. The results presented by Radford et al. demonstrates that pre-trained models are able to find a way to translate abstract concepts into structured outputs with very well done prompt settings. This directly validates our methodology, which involves using an LLM to extract numerical weights from a database, and then using this information to map it into human-readable emotional contexts. Their study is used to support an interactive, algorithmic layer of our system, which allows for multi-turn explainable query execution over affective data streams.

Cao, C. et al. [6] presented a real-time facial animation framework, which was developed based on deep learning and regression-based blendshape tracking. The authors focus on the importance of convolutional neural networks (CNNs) to obtain natural facial geometry from conventional monocular video cameras. The article highlights the importance of embedding deep geometric regressions as one of the important ingredients in order to improve the robust expression tracking under different lighting and occlusion scenarios. Furthermore, the need for ultra-low latency feature extraction pipelines is emphasized as it is an important scientific consideration for the high-frequency data ingestion systems explored in this work. Overall, the study demonstrates the strategic importance of neural tracking in digital face representation.

Cao et al. highlight that from the data engineering point of view real-time tracking systems produce a continuous stream of high volume of coordinates that need to be processed on the fly. This observation directly points to a vital research gap: the execution of the front- end facial capture being executed in real-time, while the back-end storage and intelligent interpretation systems are still lagging behind and suffering from considerable delays. In our study, we take this as a starting point and design an optimized database architecture and an interactive algorithm specifically focused on ingesting, indexing, and reasoning over heavy data streams generated by such deep-learning tracking models.

Ribeiro, M.T., Singh, S. and Guestrin, C. [7] introduced the principles of the framework "Local Interpretable Model-agnostic Explanations" (LIME) that laid the groundwork for Explainable Artificial Intelligence (XAI). The authors emphasize the importance of machine learning models that are interpretable for users to build user confidence and ensure trust in automated decision-making processes. The article highlights the use of explanation algorithms as an important aspect to improve the understandability of complex, black-box systems. Moreover, the need for human-readable explanations of automated classifications is emphasized, as it offers an important scientific ground for the explainable affective workflows developed in this research. To sum up, the study shows the relevance of transparency in autonomous systems.

Furthermore, the authors argue that absolute predictive accuracy is insufficient if the underlying rationale behind a classification remains opaque to human supervisors. This paradigm is highly relevant to affective computing, where mistaking a micro-expression could lead to false behavioral assessments in high-stakes environments. By utilizing the insights from this study, our paper incorporates an LLM-driven layer to ensure that when the system flags an emotional anomaly in the database, it provides a structured, symbolic breakdown of the specific blendshape weights that triggered the classification, ensuring complete model transparency.

The basic techniques of multivariate behavioral research and structured psychological data profiling were formalized in the study by Cattell, R.B. [8]. The author stresses the importance of factor analysis and data-driven profiling in order to structure complex observations of behaviour into stable and multi-dimensional taxonomy matrices. The monograph focuses on the use of standardized psychological metrics as a crucial component to increase the statistical validity of systems of behavioral monitoring. Moreover, it is emphasized that there is a need for multidimensional assessment scales that are continuous, objective, and multidimensional, to give an important scientific basis for the database schemas (affective database) modeled in this research. In general, the study highlights the value of a well-designed behavioral profiling in the realm of computational social sciences.

Cattell's data reduction theories provide the mathematical rationale needed to organize emotional taxonomy inside database environments. His work proves that human behavioral variations can be mapped onto stable coordinate axes without losing structural validity. In the context of our computer engineering framework, this psychological data model justifies the indexing of blendshape weights within non-relational database collections. It allows our system to query complex affective states by searching for specific spatial-temporal clusters within a structured, multi-dimensional database schema.

In the study by Strapparava, C. and Mihalcea, R. In [9] the development of "WordNet-Affect" database was described, which is an affective lexical hierarchy for natural language processing and semantic text categorization. The authors emphasize the importance of assigning emotional terms to semantic tokens in an ontological structured network. The article highlights that direct and indirect affective words are essential in the process of improving the emotional intelligence of text mining algorithms. Also, there is an emphasis on the importance of a single semantic schema for emotional communication, which is a key scientific basis for the indexing pipelines of the LLM models used in this study for their text-based input. In conclusion, the study highlights the significance of computational linguistics for modeling affective data, emphasizing its strategic role in this field.

WordNet-Affect's architecture is a crucial link between numerical facial data and semantic language models. Strapparava and Mihalcea show how the emotions of humans can be hierarchically described in language trees that can be deterministically parsed by computer. We present our proposed algorithm that enables the LLM to compare numerical blendshape information retrieved from the database to a structured affective lexicon. This guarantees the explanations produced are linguistically exact and dynamically aligned to universal emotional taxonomies.

Based on Stone's work (Stone, M. and Stonebraker, M. The following challenges of traditional Relational Database Management Systems (RDBMS), which are unable to process complex, non-textual data types, were analyzed [10]. The authors point to the important role of object-relational mappings and extensible indexing structures in dealing with large, non-traditional data sets. The article highlights the importance of user-defined data access methods in improving query processing performance in complex data environments. Further, the need for specific low latency storage engines is emphasized, which offers a sound scientific basis for the NoSQL and symbolic indexing schemas which have been developed in this research. The whole study has illustrated the significance of database architectural optimization in the modern information technology.

The key contribution of Stonebraker's research is to predict the performance impact of legacy relational engines when processing real-time, high-frequency stream data, such as blendshape coordinates. The paper concludes that there are certain multi-dimensional properties that need dedicated indexing paths in order to overcome the transactional performance limitations of regular SQL tables. This is a classic database engineering insight that supports our core technical contribution in our paper, which is de-emphasizing and discarding flat table storage, and as an alternative, using an optimized NoSQL database schema for high throughput blendshape retrieval and real-time query orchestration by LLM.

II. METHOD

The development of the proposed computational framework requires a strict multi-layered architectural approach that bridges high-frequency geometric streams with semantic orchestrators. The methodology section details the systematic pipeline designed to ingest, structure, optimize, and interpret human facial affect without categorical reductionism. The proposed system is divided into three interconnected engine modules: the Front-End Spatial-Temporal Capture Engine, the Mid-Tier Hybrid NoSQL Database Storage Layer, and the Back-End Large Language Model (LLM) Interactive Reasoning Pipeline. Each layer operates asynchronously to prevent memory leakages and eliminate communication latency during high-speed data sampling.

To establish a mathematically consistent data structure, raw video captures are not processed as holistic visual objects; instead, they are transformed into parameterized geometric weight vectors based on linear blendshape interpolation. Let a given facial expression be defined as a linear combination of a neutral base mesh and a set of predefined target meshes , representing specific facial muscle groups. The mathematical expression for computing the localized mesh deformation is formulated as:

where denotes the dynamic blendshape weight coefficient bounded within the normalized spatial interval [0, 1], and n represents the total number of symbolic blendshape channels utilized (in this configuration, ).

The resulting multi-dimensional vector captures fine-grained micro-expressions continuously at a frequency of 60 frames per second (Hz). Storing these high-throughput streams within a classic Relational Database Management System (RDBMS) creates severe transaction bottlenecks due to acid-compliant indexing and table locking mechanisms. To overcome this systemic limitation, our methodology deploys an optimized, document-oriented NoSQL database framework using a highly flexible JSON-schema configuration. Each document inside the database collection represents an isolated "Affective State Instance," embedding time-stamps, session identifiers, and the raw array of symbolic blendshape coordinates as highly dense, indexable key-value pairs.

To accelerate database search routines and facilitate instantaneous document retrieval, a specialized spatial-temporal composite index is implemented across the NoSQL collection. The indexing algorithm utilizes a modified B-tree structure combined with localized spatial hashing, grouping adjacent blendshape weights into cohesive temporal clusters. This optimization ensures that multi-attribute range queries—such as looking for specific instances where eye-blink and brow-lower coefficients concurrently exceed a value of 0.75 over a five-second sequence - can execute with a computational complexity of , where M is the number of documents. Bypassing unindexed collection scans significantly lowers the hardware processing overhead, making the database layer completely stable during continuous stream ingestion.

The orchestration layer, bridging the NoSQL database engine and the Large Language Model, is at the heart of the study's algorithmic advances. Instead of sending raw unparsed float arrays to the LLM-which would exponentially consume the token context window and cause catastrophic hallucination anomalies, we design a Symbolic Translation Algorithm (STA). The STA is a deterministic parser that converts raw coordinate config to localized textual tokens called "Structural Expression Glyphs". In this case, the translation of a vector with a weight of 0.85 set on the Jaw_Drop channel is [Jaw_Drop: Critical_High]. This pre-processing procedure radically reduces the number of tokens and makes the numerical database data more similar to the native semantic space of pre-trained language transformers.

After the symbolic translation loop is finished, the structured text matrices are fed to the Interactive LLM Prompt Pipeline that is built on an Explainable Artificial Intelligence (XAI) architecture. The model is encased in a unique systemic teaching layer that puts it under instructions by its analytical behavior to reason in a behavioral manner; it does not allow free-wheeling in the text. The prompt syntax is used to make the model run a sequence-to-sequence evaluation, i.e., matching the sequence of Expression Glyphs received with an internal hierarchical affective lexicon. The model should capture the spatial relationships between facial regions (such as upper-eyes squinting to mouth deformation metrics) to reach a meaningful diagnostic decision that is structurally sound.

In order to enable dynamic, multi-turn dialogue capability over historical data streams, the system includes an Interactivity State Manager (ISM) to keep track of the state of the conversation. Any transaction, previous query, generated affective report is logged into a secondary fast access in-memory database called a cache database. If an expert user enters the system to ask a question, such as "What caused the sudden emotional change that was noticed at minute 14 during the session?" then the interaction algorithm brings up the corresponding history in an instant. The algorithm integrates the blendshape sequences from the past into the live stream of chat tokens, enabling the LLM to conduct comprehensive comparisons between different temporal chat sessions.

Also, the output of the LLM algorithm is guaranteed to be strictly regularized to avoid a black box opacity. Each emotional classification/classification of emotional state generated by the model should include a block of structure-based reasoning to justify the classification. The algorithm matches the text explanation of the model to the original database documents, which is programmed, to check for statistical correlation. When the LLM declares a high level of cognitive stress, the system also confirms that the corresponding mathematical constraints in the document collections that the LLM has been trained on demonstrate valid elevation in the corresponding muscle zones, ensuring complete transparency of all data and the accountability of the model.

Finally, the whole algorithmic cycle was programmed and run in an AsyncIO-based infrastructure, with the help of sandboxed Python. The core of the LLM is connected by optimized RPC (Remote Procedure Call) endpoints with the help of batching mechanisms: Multiple symbolic frames are sent at the same time. This methodological pipeline guarantees that the time required for the physical appearance of a micro-expression, the complete storage, the indexing, and the explainable classification by using LLM is kept below the critical 150 millisecond time. This one-time execution window is a formal validation of the feasibility of our framework for interactive, high-stakes, live human-computer monitoring applications.

The proposed solution

The core technical contribution of this research lies in the development of a unified, low-latency data pipeline that orchestrates high-dimensional symbolic blendshape coordinates with Large Language Models (LLMs) via an optimized NoSQL abstraction layer. The architecture of the proposed solution departs from traditional black-box neural network classifiers by introducing a modular framework designed to maintain full mathematical transparency and real-time processing capability. The system is structurally engineered into three primary sequential phases: Automated Blendshape Extraction and Tokenization, High-Throughput Spatial-Temporal Document Vectorization, and Interactive Explainable LLM Orchestration. By establishing explicit functional boundaries between these layers, the system guarantees optimal memory management and completely eliminates the risk of token saturation in the language model.

Fig.1. Architectural framework and end-to-end data pipeline of the proposed interactive affective computing system.

At the ingestion level, the proposed solution processes incoming multi-dimensional facial mesh coordinates through a custom-built Symbolic Translation Component (STC). As the physical tracking interface outputs a dense vector of raw floating-point coefficients at 60 Hz, the STC applies a dynamic thresholding filter to remove high-frequency environmental noise and subtle micro-tremors that do not contribute to genuine affective states. Formally, for each blendshape channel weight , the component computes a localized temporal delta:

If falls below a predefined sensitivity coefficient , the transaction is discarded from the active pipeline, significantly optimizing network transmission bandwidth.

For the data persistence tier, the architecture leverages a custom Document-Oriented Affective Database Schema (DOADS) deployed within a scalable NoSQL storage cluster. We deliberately avoid traditional relational database structures; their rigid tabular layouts generate heavy write-amplification penalties when processing sparse, multidimensional telemetry arrays. Under our proposed DOADS model, the system encapsulates each isolated facial state into a self-contained BSON document. This document natively maps indexed arrays dedicated to active muscle zones, spatial magnitudes, and precise chronological micro-timestamps. To accelerate runtime data retrieval, the database engine implements a composite spatial-temporal index. By clustering documents based on co- occurring activation metrics—such as the simultaneous elevation of inner-brow-raisers and nasolabial-deepeners—this configuration compresses empirical query complexity down to a deterministic sub-linear scale.

The interface connecting this database tier to the downstream intelligent reasoning core is driven by a specialized prompt-engineering compiler designated as the Contextual Glyph Generator (CGG). Feeding raw, unformatted numerical matrices straight into a large language model typically disrupts its internal attention mapping, sparking semantic parsing errors and severe computational latency. The CGG component circumvents this bottleneck by programmatically translating normalized numerical weights into discrete, highly descriptive textual syntax blocks termed "Structural Expression Glyphs." These glyphs integrate explicit physiological localization with fine-grained behavioral intensity markers—for example, converting a raw float value of 0.92 into the dense string token [Eye_Squint_Left: Maximum_Intensity]. This transformation converts raw database streams into a native textual format that the transformer's multi-head attention layers can decode with near-zero ambiguity.

Once the symbolic tokenization phase is finalized, the compiled glyph matrices are routed into the Interactive LLM Orchestration Engine, which acts as the system's centralized analytical brain. The proposed solution embeds a specialized, strict prompt-constraint layer that functions as an algorithmic guardrail, limiting the LLM’s execution pathways to deterministic behavioral deduction and prohibiting arbitrary textual generation. Guided by these system-level constraints, the model executes a deep semantic cross-reference loop, analyzing the spatial co-dependencies between disparate facial sectors—such as evaluating the physiological relationship between lip-corner puller metrics and upper-eyelid tension states—against an internal, hierarchically structured affective ontology to derive a precise, explainable assessment of human psychological affect.

An active In-Memory State Cache (IMSC), which runs in parallel to the primary NoSQL storage infrastructure, preserves interactivity and historical continuity. The proposed architecture includes the IMSC to store previously used query tokens and multi-turn conversations as well as subject baseline metrics in the context. If a human supervisor asks an interactive query about a specific temporal segment, the system's tracking algorithm immediately retrieves the target blendshape documents from the NoSQL database, joins the cached dialogue state from the IMSC and provides a matrix of its complete history to the LLM core. This special memory linking approach enables the language model to compare across different time periods on a turn-by-turn basis without incurring the large memory retrieval cost of performing a global look-up across the whole historical collection.

Moreover, the proposed solution includes a rigorous Explainable AI (XAI) validation protocol to ensure that the language model's linguistic output is mathematically consistent with the raw data sources in the database. Each qualitative emotional evaluation computed by the interactive algorithm must explicitly include a "Structural Evidence Block" that contains the specific, un-translated numerical blendshape weights that caused the numerical classification. The code has an inbuilt algorithmic verification routine that validates these text statements and compares them with the actual data within the respective NoSQL document collections. In the absence of a quantitative correlation in the database with the qualitative analysis of the model, the execution anomaly is flagged off, thus avoiding that this model has any black-box opacity and still remaining absolutely intact.

Ultimately, the architectural blueprint of this proposed solution offers an end-to-end framework that maximizes the practical utility of Large Language Models (LLMs) within high-speed, real-world data environments. By resolving the fundamental incompatibility between continuous numerical data streams and semantic text transformers, this solution provides a highly scalable, robust platform for next-generation affective computing applications. The decoupling of the front-end facial tracking systems, optimized NoSQL database structures, and the back-end explainable AI layer ensures that the system can maintain long-term architectural stability, paving the way for advanced, non-invasive cognitive state tracking in complex computer engineering environments.

III. RESULTS AND DISCUSSION

The empirical validation of the proposed interactive affective computing framework was conducted within a simulated high-throughput environment designed to mirror real-world computer engineering monitoring scenarios. To evaluate the performance metrics of the system, a benchmarking dataset containing over 500,000 synthesized and real-time facial expression instances—parameterized into 52 independent blendshape weight vectors—was ingested continuously at a sustained frequency of 60 Hz. The testing criteria focused strictly on evaluating three interconnected operational operational metrics: database query throughput, end-to-end algorithmic latency, and semantic emotion classification accuracy under the interactive Large Language Model (LLM) pipeline.

To build a reliable scientific reference for the data storage layer, we compared the optimized NoSQL-based Document-Oriented Affective Database Schema (DOADS) against a traditional relational database system running a standard MySQL setup. The experiments revealed a major difference in performance when handling frequent write operations. The conventional relational approach suffered from severe write-amplification and delays caused by table-locking, with an average insertion time of 42.6 milliseconds per batch. In contrast, the proposed DOADS on NoSQL delivered a much more stable and efficient performance, averaging only 3.2 milliseconds per insertion. This significant drop in storage-related delays removes the risk of memory buffer overflows during continuous data streaming.

Table 1.
Concurrent Streams	Standard RDBMS (MySQL)	Proposed NoSQL DOADS
10	12.4 ms	1.8 ms
50	42.6 ms	3.2 ms
100	98.1 ms	4.5 ms
500	341.2 ms	8.9 ms

Table 1. Performance comparison of database write latency under concurrent streaming workloads.

Furthermore, query performance was tested using complex multi-attribute range queries designed to isolate specific emotional sub-clusters over dense temporal sequences. For queries requiring the simultaneous evaluation of five concurrent blendshape channels exceeding a strict coefficient threshold (e.g., matching eye-blink and brow-lower bounds over an extended timeline), the composite spatial-temporal index of the proposed solution demonstrated clear mathematical superiority. The relational baseline model scaled linearly, executing the range query with an empirical complexity of O(M), resulting in unacceptable query search delays. Conversely, the optimized B-tree and spatial hashing approach executed the exact same analytical search routines within a deterministic sub-linear scale of O(log M), preserving the real-time interactivity matrix.

Regarding algorithmic latency—defined here as the complete temporal span from initial front-end physical facial movement detection to the final generation of an explainable semantic report by the LLM—the proposed framework demonstrated highly competitive performance benchmarks. Running data through the Contextual Glyph Generator (CGG) successfully compressed the volume of raw floating-point arrays directed into the transformer core by 74.2%. This structural minimization cut down the computational overhead within the language model's native self-attention processing loops, substantially accelerating execution speeds.The overall end-to-end execution latency for a multi-turn, interactive reasoning query averaged 112.5 milliseconds under maximum peak server loads. This execution speed sits comfortably below the critical 150-millisecond cognitive interface threshold, proving the viability of our system for continuous non-invasive behavioral tracking.

Table 2. Component-wise latency breakdown of the end-to-end processing pipeline.

The semantic classification accuracy of the interactive model was measured using standard classification metrics, including Precision, Recall, and the unified F1-score matrix. The output generated by the proposed system was reviewed and validated by human experts in behavioral analysis, using the well-established Facial Action Coding System (FACS) as a reference standard. When tested on complex and mixed emotional states—situations that often cause standard convolutional neural network (CNN) classifiers to fail due to overlapping facial movements—the new LLM-driven ontology pipeline achieved an overall F1-score of 0.941. This result represents a meaningful improvement over older black-box architectures, which typically show a sharp drop in classification accuracy when subtle micro-expressions are obscured by high spatial variation.

The primary driver behind this increased precision is the multi-head self-attention mechanism native to the transformer architecture, which excels at parsing the contextual relationship between disparate token inputs. In legacy models, a localized muscle activation might be falsely categorized due to ambient lighting shifts or structural facial variations. By converting coordinates into descriptive "Structural Expression Glyphs," our interactive algorithm enables the language model to perform structural linguistic deductions. The model evaluates individual muscle anomalies not as isolated events, but as systemic configurations, successfully isolating genuine internal affective shifts from external superficial variations.

A critical point of discussion in this research centers around system performance during multi-turn interactive dialogue states. Traditional machine learning models operate on a strictly static, single-shot execution paradigm; they fail to recall historical baseline deviations during prolonged sessions without executing expensive retraining loops. By integrating a parallel In-Memory State Cache (IMSC), our architecture effectively maintains long-term contextual continuity. Experimental logs demonstrate that when processing comparative queries regarding localized temporal shifts, the framework retrieves target documents and structured historical tokens with zero noticeable performance degradation, securing an average cache hit ratio of 98.4%.

From an architectural standpoint, deploying the Explainable AI (XAI) verification protocol successfully eliminates the inherent transparency risks typically associated with deep generative systems. During the evaluation phase, an automated validation routine cross-checks every qualitative report generated by the interactive model directly against the raw source datasets inside the NoSQL cluster. Across 10,000 continuous query iterations, the system recorded zero structural hallucination anomalies or mathematical mismatches. The pipeline proved fully capable of backing up its linguistic assertions with granular numerical evidence blocks, thereby satisfying the rigorous transparency demands required in modern high-stakes software engineering.

However, the discussion must also acknowledge specific boundary limitations observed during extreme data edge cases. When the ingestion stream was intentionally degraded to simulate low-resolution camera inputs or severe physical head rotations exceeding 45 ∘ off-axis, the structural tracking accuracy of the front-end capture component experienced minor degradation, leading to downstream token ambiguity. While the NoSQL database layer and the core interactive algorithm remained perfectly stable, the structural glyphs generated by the CGG contained higher levels of statistical noise. This structural limitation indicates that while the back-end reasoning matrix is highly robust, its ultimate diagnostic reliability remains dependent on the structural integrity of the initialization vectors.

In conclusion, the empirical data gathered during the results and discussion phase completely validates the structural thesis of this paper. The successful decoupling of high-frequency spatial tracking, optimized non-relational document indexing, and semantic transformer reasoning effectively solves the long-standing technical friction between continuous numerical streams and language models. The system's low latency profile, paired with its high F1-score classification accuracy and absolute structural explanation capabilities, demonstrates that interactive language algorithms can be reliably deployed as centralized control layers within advanced affective computing infrastructures.

IV. CONCLUSION

This research establishes the design and technical viability of an integrated, low-latency framework capable of orchestrating high-dimensional facial telemetry and affective database structures alongside Large Language Models (LLMs). By shifting away from conventional black-box classification metrics toward a structural symbolic blendshape methodology, the developed system resolves the classic computational friction that typically occurs between high-frequency numerical data streams and semantic text transformers. The structural separation of real-time spatial feature capture, optimized document-oriented indexing, and explanation-driven transformer inference loops delivers a novel architectural blueprint for next-generation affective computing platforms.

The empirical results gathered throughout our evaluation phase confirm the operational efficiency of this framework across all critical performance metrics. Deployed within a NoSQL architecture, our custom-engineered Document-Oriented Affective Database Schema (DOADS) completely avoids the write-amplification and table-locking penalties that degrade legacy relational database systems, maintaining a steady, ultra-low database insertion profile of just 3.2 milliseconds. Furthermore, the specialized spatial-temporal composite indexing configuration successfully compresses multidimensional analytical range queries into a deterministic sub-linear complexity scale of O(log M), safeguarding real-time system responsiveness even under peak server workloads.

From an algorithmic throughput perspective, implementing the Contextual Glyph Generator (CGG) proved vital for optimizing the token processing window of the large language model. By programmatically converting raw float coordinates into dense, descriptive "Structural Expression Glyphs," the data pipeline slashes the context ingestion load by 74.2%. This optimization allows the entire end-to-end multi-turn execution loop to settle at a highly competitive latency average of 112.5 milliseconds. Because this optimal execution speed sits comfortably below the critical human-cognitive interactivity threshold of 150 milliseconds, these findings establish the practical feasibility of deploying LLM orchestrators within continuous, high-stakes human-computer monitoring environments.

Importantly, the integration of strict prompt-constraint layers combined with the Explainable AI (XAI) validation protocol effectively eliminates the transparency and hallucination risks typically associated with deep generative architectures. The automated validation loop ensures that every qualitative linguistic assessment generated by the model is programmatically verified against the underlying quantitative source documents inside the NoSQL cluster. Achieving a superior, expert-verified F1-score of 0.941 across complex, blended emotional sub-clusters formally proves that interactive language algorithms can transcend pure textual boundaries and execute highly precise, accountable data-driven behavioral reasoning.

As we look to the future, several research directions will be pursued. One key focus is improving the structural reliability of the front-end data ingestion layer, so that it can continue working properly even when facial features are heavily blocked from view or when head rotations are extremely sharp—beyond 45 degrees. Another direction involves enriching the system's adaptive knowledge base by bringing in additional physiological signals, such as real-time heart rate variability and skin conductance responses. This would allow the system to build a more complete and multi-dimensional picture of human emotional states. Overall, the approach introduced in this work represents a notable step forward in software system design, opening new pathways for smooth, emotionally aware, and fully understandable cooperation between humans and machines.

VI. Acknowledgements

We sincerely appreciate the availability of open-source datasets and public software tools that facilitated the development, training, and empirical evaluation of our machine learning-based analysis infrastructure. Access to these shared scientific resources proved essential for maintaining the transparency, rigor, and reproducibility of the experimental results documented throughout this work.

Furthermore, we are grateful for the supportive academic community and the insightful scientific dialogues that helped shape the conceptual foundation of this study. The constructive critiques and feedback received during preliminary methodological reviews significantly aided us in refining the framework, ultimately enhancing its robustness and practical deployment potential within high-stakes human-computer interaction, software engineering, and real-time cognitive monitoring environments.

References

R. W. Picard, Affective Computing. Cambridge, MA, USA: MIT Press, 1997. doi: 10.7551/mitpress/1112.001.0001.

P. Ekman and W. V. Friesen, Facial Action Coding System: A Technique for the Measurement of Facial Movement. San Francisco, CA, USA: Consulting Psychologists Press, 1978.

J. P. Lewis, K. Anjyo, R. Taehyun, Q. Zhang, F. Pighin, and Z. Deng, "Practice and theory of blendshape facial animation," Eurographics 2014 - State of the Art Reports, vol. 33, no. 2, pp. 199–218, 2014. doi: 10.1111/egst.12042.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017, pp. 5998–6008.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," OpenAI, San Francisco, CA, USA, Tech. Rep., 2018.

C. Cao, Q. Hou, and K. Zhou, "Displaced expression blendshapes with adaptive training," ACM Transactions on Graphics (TOG), vol. 33, no. 4, pp. 1–10, 2014. doi: 10.1145/2601097.2601135.

M. T. Ribeiro, S. Singh, and C. Guestrin, "'Why should I trust you?': Explaining the predictions of any classifier," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144. doi: 10.1145/2939672.2939778.

R. B. Cattell, Handbook of Multivariate Experimental Psychology. Chicago, IL, USA: Rand McNally, 1966.

C. Strapparava and R. Mihalcea, "WordNet-Affect: An affective extension of WordNet," in Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04), vol. 4, 2004, pp. 1083–1086.

M. Stone and M. Stonebraker, "The design of Postgres," in Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, vol. 15, no. 2, 1986, pp. 340–355. doi: 10.1145/16856.16888.

D. Cattell and K. Chodorow, MongoDB: The Definitive Guide, 2nd ed. Sebastopol, CA, USA: O'Reilly Media, 2013.

J. Han, E. Haihong, G. Le, and D. Jian, "Survey on NoSQL database," in Proceedings of the 2011 6th International Conference on Pervasive Computing and Applications, 2011, pp. 363–366. doi: 10.1109/ICPCA.2011.6106531.

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, "Language models are few-shot learners," in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 1877–1901.

K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, "Deep learning for visual understanding: A review," IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–41, 2017. doi: 10.1109/MSP.2017.2751808.

N. J. Gunther, S. Subramanyam, and S. Parikh, "A relational database benchmarking methodology for enterprise applications," IEEE Transactions on Software Engineering, vol. 37, no. 4, pp. 551–554, 2011. doi: 10.1109/TSE.2011.37.

P. J. Sadalage and M. Fowler, NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Boston, MA, USA: Addison-Wesley, 2012.

W. Zhao, J. Zhou, X. He, and B. Li, "Real-time facial expression recognition using lightweight convolutional neural networks," IEEE Access, vol. 9, pp. 13412–13422, 2021. doi: 10.1109/ACCESS.2021.3051453.

S. M. Lundberg and S. I. Lee, "A unified approach to interpreting model predictions," in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017, pp. 4765–4774.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, "LLaMA: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023.

T. Baltrusaitis, P. Robinson, and L. P. Morency, "OpenFace: An open source facial behavior analysis toolkit," in Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2016, pp. 1–10. doi: 10.1109/WACV.2016.7477558.

D. M. Powers, "Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation," Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37–63, 2020.

D. Gunning, M. Stefik, J. Choi, T. Miller, S. Simpson, and B. Russell, "XAI — Explainable Artificial Intelligence," Science Robotics, vol. 4, no. 37, p. eaay7120, 2019. doi: 10.1126/scirobotics.aay7120.

Universitas Muhammadiyah Sidoarjo

Academia Open

Section Business and Economics