Table of Contents
Abstract
Tracking mental health via automated text analysis is an efficient screening solution due to its scalability and accessibility, especially for the Arabic language which is underrepresented. This study proposes a dual-model machine learning system for assessing depression and anxiety severity from Arabic text.
1. Research Overview
Key Innovation
This research addresses a critical gap in mental health monitoring by developing an AI-based longitudinal tracking system that evaluates behavioral and linguistic indicators over time, identifies early indicators of decline, and promptly notifies users when alarming patterns appear.
The Problem
Most existing mental health research relies on static, cross-sectional data collected at a single point in time, which restricts the ability to capture gradual or sudden changes in anxiety and depression levels. This often results in delayed diagnosis until individuals reach critical stages.
The Solution
An AI-based longitudinal monitoring system that moves mental health analysis from static classification to trajectory-based evaluation using high-fidelity, AI-generated synthetic longitudinal data that precisely replicates actual psychological patterns.
3. Methodology
Dataset
The dataset comprises synthetically generated Arabic text entries created through AI-assisted generation (Claude). Each entry is independently annotated with depression and anxiety severity scores on a four-point ordinal scale:
- 0: None
- 1: Mild
- 2: Moderate
- 3: Severe
Technical Architecture
Feature Engineering
EmbeddingGemma-300M Model (Google)
- 768-dimensional output embedding
- Multilingual support for 100+ languages including Arabic
- Transforms natural language into numerical vector representations
- Captures semantic and contextual information
Classification Models
The system implements a dual-model architecture:
- Depression Severity Prediction Model
- Anxiety Severity Prediction Model
Both models use Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel to capture non-linear class boundaries arising from the morphological richness and sparsity of Arabic text features.
4. Results & Performance
Depression Score Prediction Model
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 (None) | 1.00 | 1.00 | 1.00 | 256 |
| 1 (Mild) | 1.00 | 1.00 | 1.00 | 203 |
| 2 (Moderate) | 1.00 | 1.00 | 1.00 | 166 |
| 3 (Severe) | 0.99 | 1.00 | 1.00 | 125 |
Anxiety Score Prediction Model
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 (None) | 1.00 | 1.00 | 1.00 | 260 |
| 1 (Mild) | 1.00 | 1.00 | 1.00 | 204 |
| 2 (Moderate) | 0.97 | 0.98 | 0.98 | 170 |
| 3 (Severe) | 0.97 | 0.96 | 0.97 | 116 |
ROC Curves & AUC Scores
Depression Model AUC
| Class | AUC Score |
|---|---|
| 0 (None) | 1.00 |
| 1 (Mild) | 1.00 |
| 2 (Moderate) | 1.00 |
| 3 (Severe) | 0.99 |
| Macro-Average | 1.00 |
| Weighted-Average | 1.00 |
Anxiety Model AUC
| Class | AUC Score |
|---|---|
| 0 (None) | 1.00 |
| 1 (Mild) | 1.00 |
| 2 (Moderate) | 0.98 |
| 3 (Severe) | 0.97 |
| Macro-Average | 0.99 |
| Weighted-Average | 0.99 |
AUC Interpretation
- Depression Model: Achieves near-perfect discrimination (AUC = 1.00) for all severity levels
- Anxiety Model: Maintains highly effective discrimination (AUC ≥ 0.97) across all classes
- Minor reduction in higher severity anxiety classes suggests slightly more overlap in feature space
- Both models demonstrate excellent ability to distinguish between severity levels
5. Alert System
The system implements a three-tier alert mechanism for proactive mental health monitoring:
1. High Score Alerts
Rationale: Sustained moderate-to-severe symptoms indicate persistent mental health concerns requiring attention.
Clinical Significance: Scores of 2 (moderate) or 3 (severe) represent clinically significant symptom levels.
2. Worsening Trend Alerts
Rationale: Progressive worsening suggests deteriorating mental health that may require intervention before reaching critical levels.
Clinical Significance: Early detection of declining mental health enables proactive intervention.
3. Sudden Spike Alerts
Rationale: Rapid deterioration may indicate acute crisis or triggering events.
Clinical Significance: Sudden changes require immediate attention as they may represent crisis situations.
6. Key Findings
Outstanding Performance
- Depression model achieved near-perfect classification (99.87% accuracy)
- Anxiety model demonstrated strong performance (98.93% accuracy)
- Both models showed excellent discrimination across all severity levels
- Minimal confusion between non-adjacent severity classes
Model Comparison
| Metric | Depression Model | Anxiety Model |
|---|---|---|
| Accuracy | 99.87% | 98.93% |
| Precision | 1.00 | 0.99 |
| Recall | 1.00 | 0.99 |
| F1-Score | 1.00 | 0.99 |
The depression model slightly outperformed the anxiety model, suggesting that textual expressions of depression may be more linguistically determinable, while anxiety expressions show more linguistic variation.
Confusion Matrix Analysis
Depression Model Observations
- Classes 0, 1, and 2 were classified with perfect precision, recall, and F1-score
- Class 3 achieved a precision of 0.99 and a recall of 1.00, indicating minimal false-positive predictions
- Most misclassifications occur between adjacent severity levels (e.g., mild vs. moderate)
- Severe cases are rarely confused with none/mild classes
- The model demonstrates extremely high accuracy across all classes
Anxiety Model Observations
- Classes 0 and 1 were classified perfectly
- Class 2 achieved precision of 0.97, recall of 0.98, and F1-score of 0.98
- Class 3 had precision of 0.97 and recall of 0.96, reflecting minor misclassifications
- The model slightly underperforms in distinguishing moderate and severe anxiety cases compared to depression severity prediction
- Misclassifications mostly occur between classes 2 and 3
- Overall performance remains robust with macro-averaged metrics close to 0.99
Clinical Interpretation
From a clinical perspective, borderline cases between adjacent severity levels are inherently subjective and often difficult to maintain consistency on even for human evaluators. Thus, the errors that did occur are not indicative of a fundamental weakness of the approach but reflect the continuous and gradual nature of symptom severity across mental health conditions.
7. Limitations
Technical Limitations
- Single-Sentence Context Modeling: Limited ability to model disease progression over time based on previous entries
- Predefined Embedding Space: Not fine-tuned on mental health-specific Arabic text, potentially missing subtle expressions of distress
- Language Variability: Dialect variations between Modern Standard Arabic (MSA) and regional dialects not fully addressed
Clinical & Practical Limitations
- Self-Report Bias: Vulnerable to underreporting, overreporting, and social desirability bias
- Lack of Clinical Ground Truth: Labels based on self-rated severity rather than clinical diagnoses
- Cross-Cultural Generalization: Training data from specific Arabic-speaking community may limit broader applicability
- Comorbidity Modeling: Separate classifiers don't capture shared symptom patterns between depression and anxiety
8. Ethical Considerations
Responsible AI Development
- Privacy Protection: Synthetic data generation eliminates risks to individual privacy
- Bias Mitigation: Careful examination of generated text for linguistic imbalances
- Explainability: Transparent methods and interpretable features
- Non-Diagnostic Use: Positioned as supportive screening tool, not diagnostic instrument
- Fairness & Transparency: Development guided by responsible AI principles
9. Lessons Learned
- Transfer Learning Effectiveness: General-purpose Arabic embeddings proved highly effective without task-specific fine-tuning, lowering barriers to developing robust mental health NLP systems in low-resource domains.
- Cultural & Linguistic Context: Strong need for culturally grounded datasets and dialect-aware modeling approaches, as subtle variations significantly influence performance.
- Supportive vs. Diagnostic Role: Automated systems should prioritize transparency, ethical deployment, and human-in-the-loop decision making rather than replacing clinical judgment.
10. Conclusion
This study successfully developed a high-performance AI model for early detection and monitoring of depression and anxiety using Arabic text analysis. The system achieved exceptional accuracy rates (99.87% for depression, 98.93% for anxiety) while implementing an intelligent three-tier alert mechanism for longitudinal monitoring.
Important Note
Despite its high performance, this model serves as a supportive tool for mental health specialists and does not replace clinical judgment. The system is designed to augment, not replace, professional mental health care.
Future Work
- Clinical validation against standardized diagnostic assessments
- Domain-specific fine-tuning of Arabic embeddings for mental health contexts
- Sequential modeling approaches (e.g., LSTM networks) to incorporate longitudinal user context
- Expansion to include dialect-specific modeling
- Multi-task learning for comorbidity modeling
Impact & Significance
This research establishes a foundation for more proactive, accurate, and ethically responsible mental health monitoring aligned with global sustainability goals and supports the development of a more resilient and mentally aware society.