AI-Based Longitudinal Tracking for Early Prediction of Mental Health Decline with Predictive Alerts

Institution: Prince Sultan University, Riyadh, Saudi Arabia

College: Computer and Information Sciences

Layan Alnasser Researcher
Hessah Alsubaiee Researcher
Leen Almunifi Researcher
Norah Alhadyani Researcher
Mashael Alsedais Researcher
Supervised by:
Dr. Souad Larabi-Marie-Sainte

Abstract

Tracking mental health via automated text analysis is an efficient screening solution due to its scalability and accessibility, especially for the Arabic language which is underrepresented. This study proposes a dual-model machine learning system for assessing depression and anxiety severity from Arabic text.

Mental Health Depression Anxiety Arabic NLP Machine Learning Text Classification Automated Screening

1. Research Overview

Key Innovation

This research addresses a critical gap in mental health monitoring by developing an AI-based longitudinal tracking system that evaluates behavioral and linguistic indicators over time, identifies early indicators of decline, and promptly notifies users when alarming patterns appear.

The Problem

Most existing mental health research relies on static, cross-sectional data collected at a single point in time, which restricts the ability to capture gradual or sudden changes in anxiety and depression levels. This often results in delayed diagnosis until individuals reach critical stages.

The Solution

An AI-based longitudinal monitoring system that moves mental health analysis from static classification to trajectory-based evaluation using high-fidelity, AI-generated synthetic longitudinal data that precisely replicates actual psychological patterns.

3. Methodology

Dataset

2,500
Arabic Text Entries
100
Individuals Tracked
25
Entries per Individual
70/30
Train/Test Split

The dataset comprises synthetically generated Arabic text entries created through AI-assisted generation (Claude). Each entry is independently annotated with depression and anxiety severity scores on a four-point ordinal scale:

  • 0: None
  • 1: Mild
  • 2: Moderate
  • 3: Severe

Technical Architecture

Feature Engineering

EmbeddingGemma-300M Model (Google)

  • 768-dimensional output embedding
  • Multilingual support for 100+ languages including Arabic
  • Transforms natural language into numerical vector representations
  • Captures semantic and contextual information

Classification Models

The system implements a dual-model architecture:

  • Depression Severity Prediction Model
  • Anxiety Severity Prediction Model

Both models use Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel to capture non-linear class boundaries arising from the morphological richness and sparsity of Arabic text features.

4. Results & Performance

Depression Score Prediction Model

99.87%
Accuracy
1.00
Precision
1.00
Recall
1.00
F1-Score
Class Precision Recall F1-Score Support
0 (None) 1.00 1.00 1.00 256
1 (Mild) 1.00 1.00 1.00 203
2 (Moderate) 1.00 1.00 1.00 166
3 (Severe) 0.99 1.00 1.00 125

Anxiety Score Prediction Model

98.93%
Accuracy
0.99
Precision
0.99
Recall
0.99
F1-Score
Class Precision Recall F1-Score Support
0 (None) 1.00 1.00 1.00 260
1 (Mild) 1.00 1.00 1.00 204
2 (Moderate) 0.97 0.98 0.98 170
3 (Severe) 0.97 0.96 0.97 116

ROC Curves & AUC Scores

Depression Model AUC

Class AUC Score
0 (None) 1.00
1 (Mild) 1.00
2 (Moderate) 1.00
3 (Severe) 0.99
Macro-Average 1.00
Weighted-Average 1.00

Anxiety Model AUC

Class AUC Score
0 (None) 1.00
1 (Mild) 1.00
2 (Moderate) 0.98
3 (Severe) 0.97
Macro-Average 0.99
Weighted-Average 0.99

AUC Interpretation

  • Depression Model: Achieves near-perfect discrimination (AUC = 1.00) for all severity levels
  • Anxiety Model: Maintains highly effective discrimination (AUC ≥ 0.97) across all classes
  • Minor reduction in higher severity anxiety classes suggests slightly more overlap in feature space
  • Both models demonstrate excellent ability to distinguish between severity levels

5. Alert System

The system implements a three-tier alert mechanism for proactive mental health monitoring:

1. High Score Alerts

Trigger: Average score ≥ 2 over most recent 3 entries

Rationale: Sustained moderate-to-severe symptoms indicate persistent mental health concerns requiring attention.

Clinical Significance: Scores of 2 (moderate) or 3 (severe) represent clinically significant symptom levels.

2. Worsening Trend Alerts

Trigger: Average increase > 0.5 points over 3 consecutive entries

Rationale: Progressive worsening suggests deteriorating mental health that may require intervention before reaching critical levels.

Clinical Significance: Early detection of declining mental health enables proactive intervention.

3. Sudden Spike Alerts

Trigger: Increase of ≥ 2 points between consecutive entries

Rationale: Rapid deterioration may indicate acute crisis or triggering events.

Clinical Significance: Sudden changes require immediate attention as they may represent crisis situations.

6. Key Findings

Outstanding Performance

  • Depression model achieved near-perfect classification (99.87% accuracy)
  • Anxiety model demonstrated strong performance (98.93% accuracy)
  • Both models showed excellent discrimination across all severity levels
  • Minimal confusion between non-adjacent severity classes

Model Comparison

Metric Depression Model Anxiety Model
Accuracy 99.87% 98.93%
Precision 1.00 0.99
Recall 1.00 0.99
F1-Score 1.00 0.99

The depression model slightly outperformed the anxiety model, suggesting that textual expressions of depression may be more linguistically determinable, while anxiety expressions show more linguistic variation.

Confusion Matrix Analysis

Depression Model Observations

  • Classes 0, 1, and 2 were classified with perfect precision, recall, and F1-score
  • Class 3 achieved a precision of 0.99 and a recall of 1.00, indicating minimal false-positive predictions
  • Most misclassifications occur between adjacent severity levels (e.g., mild vs. moderate)
  • Severe cases are rarely confused with none/mild classes
  • The model demonstrates extremely high accuracy across all classes

Anxiety Model Observations

  • Classes 0 and 1 were classified perfectly
  • Class 2 achieved precision of 0.97, recall of 0.98, and F1-score of 0.98
  • Class 3 had precision of 0.97 and recall of 0.96, reflecting minor misclassifications
  • The model slightly underperforms in distinguishing moderate and severe anxiety cases compared to depression severity prediction
  • Misclassifications mostly occur between classes 2 and 3
  • Overall performance remains robust with macro-averaged metrics close to 0.99

Clinical Interpretation

From a clinical perspective, borderline cases between adjacent severity levels are inherently subjective and often difficult to maintain consistency on even for human evaluators. Thus, the errors that did occur are not indicative of a fundamental weakness of the approach but reflect the continuous and gradual nature of symptom severity across mental health conditions.

7. Limitations

Technical Limitations

  • Single-Sentence Context Modeling: Limited ability to model disease progression over time based on previous entries
  • Predefined Embedding Space: Not fine-tuned on mental health-specific Arabic text, potentially missing subtle expressions of distress
  • Language Variability: Dialect variations between Modern Standard Arabic (MSA) and regional dialects not fully addressed

Clinical & Practical Limitations

  • Self-Report Bias: Vulnerable to underreporting, overreporting, and social desirability bias
  • Lack of Clinical Ground Truth: Labels based on self-rated severity rather than clinical diagnoses
  • Cross-Cultural Generalization: Training data from specific Arabic-speaking community may limit broader applicability
  • Comorbidity Modeling: Separate classifiers don't capture shared symptom patterns between depression and anxiety

8. Ethical Considerations

Responsible AI Development

  • Privacy Protection: Synthetic data generation eliminates risks to individual privacy
  • Bias Mitigation: Careful examination of generated text for linguistic imbalances
  • Explainability: Transparent methods and interpretable features
  • Non-Diagnostic Use: Positioned as supportive screening tool, not diagnostic instrument
  • Fairness & Transparency: Development guided by responsible AI principles

9. Lessons Learned

  1. Transfer Learning Effectiveness: General-purpose Arabic embeddings proved highly effective without task-specific fine-tuning, lowering barriers to developing robust mental health NLP systems in low-resource domains.
  2. Cultural & Linguistic Context: Strong need for culturally grounded datasets and dialect-aware modeling approaches, as subtle variations significantly influence performance.
  3. Supportive vs. Diagnostic Role: Automated systems should prioritize transparency, ethical deployment, and human-in-the-loop decision making rather than replacing clinical judgment.

10. Conclusion

This study successfully developed a high-performance AI model for early detection and monitoring of depression and anxiety using Arabic text analysis. The system achieved exceptional accuracy rates (99.87% for depression, 98.93% for anxiety) while implementing an intelligent three-tier alert mechanism for longitudinal monitoring.

Important Note

Despite its high performance, this model serves as a supportive tool for mental health specialists and does not replace clinical judgment. The system is designed to augment, not replace, professional mental health care.

Future Work

  • Clinical validation against standardized diagnostic assessments
  • Domain-specific fine-tuning of Arabic embeddings for mental health contexts
  • Sequential modeling approaches (e.g., LSTM networks) to incorporate longitudinal user context
  • Expansion to include dialect-specific modeling
  • Multi-task learning for comorbidity modeling

Impact & Significance

This research establishes a foundation for more proactive, accurate, and ethically responsible mental health monitoring aligned with global sustainability goals and supports the development of a more resilient and mentally aware society.