AI-Based Mental Health Tracking Research Report

Abstract

Tracking mental health via automated text analysis is an efficient screening solution due to its scalability and accessibility, especially for the Arabic language which is underrepresented. This study proposes a dual-model machine learning system for assessing depression and anxiety severity from Arabic text.

Mental Health Depression Anxiety Arabic NLP Machine Learning Text Classification Automated Screening

2. Related Work

AI in Healthcare

Artificial Intelligence has been widely adopted in various domains, particularly in physical health. Its ability to rapidly analyze large-scale data and recognize complex patterns has enabled effective disease detection and monitoring. Despite these advancements, the adoption of AI in mental health has progressed more slowly due to ethical concerns, data sensitivity, and the inherent complexity of psychological conditions.

AI Applications in Mental Health

AI enables new approaches to mental illness analysis by integrating patient information from diverse sources such as electronic medical records, self-reported questionnaires, and social media content. Within the mental health domain, AI applications are commonly categorized into three main areas:

Digital Phenotyping: Collecting behavioral data through sensors, mobile devices, speech signals, and interaction patterns to infer changes in mood and mental state
Natural Language Processing: Analyzing text or speech to identify emotional and cognitive indicators
Conversational Agents: Chatbots applying human-computer interaction principles to support dialogue-based mental health applications

Machine Learning Approaches

Tree-based models such as classification and regression trees have demonstrated strong effectiveness in mental health-related tasks due to their ability to handle both continuous and categorical features, perform automatic feature selection, and provide interpretable results. Previous research has shown that AI-based models can recognize mental health risk patterns across diverse data sources.

                    Notable Research Examples
                    Systems predicting suicide attempts up to 24 hours before occurrence by analyzing large-scale datasets, with performance exceeding expert clinical judgment in certain scenarios
AI-based systems continuously monitoring dialogue to identify depression-related language patterns and dynamically assess risk levels
Deep learning models (CNNs and LSTMs) successfully classifying depressive content in social media text

                

Evolution of Research Approaches

Earlier studies primarily relied on structured, cross-sectional datasets composed of predefined demographic, socioeconomic, and behavioral variables. More recent research has adopted longitudinal designs using repeated measurements to better understand mental health progression over time. Growing attention has been directed toward text-based data, as linguistic patterns have been shown to reflect emotional well-being and symptoms of mental disorders.

This Study's Contribution

Building on this body of research, the current project focuses on leveraging text-based narrative analysis for mental health risk detection while emphasizing ethical use and interpretability. Unlike many previous studies that prioritize model accuracy or direct diagnostic classification, this project explicitly positions artificial intelligence as a supportive tool rather than a diagnostic system. The goal is not to replace clinical assessment, but to assist in identifying potential risk signals from narrative text that may support early awareness and intervention.

1. Research Overview

Key Innovation

This research addresses a critical gap in mental health monitoring by developing an AI-based longitudinal tracking system that evaluates behavioral and linguistic indicators over time, identifies early indicators of decline, and promptly notifies users when alarming patterns appear.

The Problem

Most existing mental health research relies on static, cross-sectional data collected at a single point in time, which restricts the ability to capture gradual or sudden changes in anxiety and depression levels. This often results in delayed diagnosis until individuals reach critical stages.

The Solution

An AI-based longitudinal monitoring system that moves mental health analysis from static classification to trajectory-based evaluation using high-fidelity, AI-generated synthetic longitudinal data that precisely replicates actual psychological patterns.

3. Methodology

Dataset

2,500

Arabic Text Entries

100

Individuals Tracked

25

Entries per Individual

70/30

Train/Test Split

The dataset comprises synthetically generated Arabic text entries created through AI-assisted generation (Claude). Each entry is independently annotated with depression and anxiety severity scores on a four-point ordinal scale:

0: None
1: Mild
2: Moderate
3: Severe

Technical Architecture

Feature Engineering

EmbeddingGemma-300M Model (Google)

768-dimensional output embedding
Multilingual support for 100+ languages including Arabic
Transforms natural language into numerical vector representations
Captures semantic and contextual information

Classification Models

The system implements a dual-model architecture:

Depression Severity Prediction Model
Anxiety Severity Prediction Model

Both models use Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel to capture non-linear class boundaries arising from the morphological richness and sparsity of Arabic text features.

4. Results & Performance

Depression Score Prediction Model

99.87%

Accuracy

1.00

Precision

1.00

Recall

1.00

F1-Score

Class	Precision	Recall	F1-Score	Support
0 (None)	1.00	1.00	1.00	256
1 (Mild)	1.00	1.00	1.00	203
2 (Moderate)	1.00	1.00	1.00	166
3 (Severe)	0.99	1.00	1.00	125

Anxiety Score Prediction Model

98.93%

Accuracy

0.99

Precision

0.99

Recall

0.99

F1-Score

Class	Precision	Recall	F1-Score	Support
0 (None)	1.00	1.00	1.00	260
1 (Mild)	1.00	1.00	1.00	204
2 (Moderate)	0.97	0.98	0.98	170
3 (Severe)	0.97	0.96	0.97	116

ROC Curves & AUC Scores

Depression Model AUC

Class	AUC Score
0 (None)	1.00
1 (Mild)	1.00
2 (Moderate)	1.00
3 (Severe)	0.99
Macro-Average	1.00
Weighted-Average	1.00

Anxiety Model AUC

Class	AUC Score
0 (None)	1.00
1 (Mild)	1.00
2 (Moderate)	0.98
3 (Severe)	0.97
Macro-Average	0.99
Weighted-Average	0.99

                    AUC Interpretation
                    Depression Model: Achieves near-perfect discrimination (AUC = 1.00) for all severity levels
Anxiety Model: Maintains highly effective discrimination (AUC ≥ 0.97) across all classes
Minor reduction in higher severity anxiety classes suggests slightly more overlap in feature space
Both models demonstrate excellent ability to distinguish between severity levels

                

5. Alert System

The system implements a three-tier alert mechanism for proactive mental health monitoring:

1. High Score Alerts

Trigger: Average score ≥ 2 over most recent 3 entries

Rationale: Sustained moderate-to-severe symptoms indicate persistent mental health concerns requiring attention.

Clinical Significance: Scores of 2 (moderate) or 3 (severe) represent clinically significant symptom levels.

2. Worsening Trend Alerts

Trigger: Average increase > 0.5 points over 3 consecutive entries

Rationale: Progressive worsening suggests deteriorating mental health that may require intervention before reaching critical levels.

Clinical Significance: Early detection of declining mental health enables proactive intervention.

3. Sudden Spike Alerts

Trigger: Increase of ≥ 2 points between consecutive entries

Rationale: Rapid deterioration may indicate acute crisis or triggering events.

Clinical Significance: Sudden changes require immediate attention as they may represent crisis situations.

6. Key Findings

                    Outstanding Performance
                    Depression model achieved near-perfect classification (99.87% accuracy)
Anxiety model demonstrated strong performance (98.93% accuracy)
Both models showed excellent discrimination across all severity levels
Minimal confusion between non-adjacent severity classes

                

Model Comparison

Metric	Depression Model	Anxiety Model
Accuracy	99.87%	98.93%
Precision	1.00	0.99
Recall	1.00	0.99
F1-Score	1.00	0.99

The depression model slightly outperformed the anxiety model, suggesting that textual expressions of depression may be more linguistically determinable, while anxiety expressions show more linguistic variation.

Confusion Matrix Analysis

Depression Model Observations

                    Classes 0, 1, and 2 were classified with perfect precision, recall, and F1-score
Class 3 achieved a precision of 0.99 and a recall of 1.00, indicating minimal false-positive predictions
Most misclassifications occur between adjacent severity levels (e.g., mild vs. moderate)
Severe cases are rarely confused with none/mild classes
The model demonstrates extremely high accuracy across all classes

                

Anxiety Model Observations

                    Classes 0 and 1 were classified perfectly
Class 2 achieved precision of 0.97, recall of 0.98, and F1-score of 0.98
Class 3 had precision of 0.97 and recall of 0.96, reflecting minor misclassifications
The model slightly underperforms in distinguishing moderate and severe anxiety cases compared to depression severity prediction
Misclassifications mostly occur between classes 2 and 3
Overall performance remains robust with macro-averaged metrics close to 0.99

                

Clinical Interpretation

From a clinical perspective, borderline cases between adjacent severity levels are inherently subjective and often difficult to maintain consistency on even for human evaluators. Thus, the errors that did occur are not indicative of a fundamental weakness of the approach but reflect the continuous and gradual nature of symptom severity across mental health conditions.

7. Limitations

Technical Limitations

Single-Sentence Context Modeling: Limited ability to model disease progression over time based on previous entries
Predefined Embedding Space: Not fine-tuned on mental health-specific Arabic text, potentially missing subtle expressions of distress
Language Variability: Dialect variations between Modern Standard Arabic (MSA) and regional dialects not fully addressed

Clinical & Practical Limitations

Self-Report Bias: Vulnerable to underreporting, overreporting, and social desirability bias
Lack of Clinical Ground Truth: Labels based on self-rated severity rather than clinical diagnoses
Cross-Cultural Generalization: Training data from specific Arabic-speaking community may limit broader applicability
Comorbidity Modeling: Separate classifiers don't capture shared symptom patterns between depression and anxiety

8. Ethical Considerations

                    Responsible AI Development
                    Privacy Protection: Synthetic data generation eliminates risks to individual privacy
Bias Mitigation: Careful examination of generated text for linguistic imbalances
Explainability: Transparent methods and interpretable features
Non-Diagnostic Use: Positioned as supportive screening tool, not diagnostic instrument
Fairness & Transparency: Development guided by responsible AI principles

                

9. Lessons Learned

Transfer Learning Effectiveness: General-purpose Arabic embeddings proved highly effective without task-specific fine-tuning, lowering barriers to developing robust mental health NLP systems in low-resource domains.
Cultural & Linguistic Context: Strong need for culturally grounded datasets and dialect-aware modeling approaches, as subtle variations significantly influence performance.
Supportive vs. Diagnostic Role: Automated systems should prioritize transparency, ethical deployment, and human-in-the-loop decision making rather than replacing clinical judgment.

10. Conclusion

This study successfully developed a high-performance AI model for early detection and monitoring of depression and anxiety using Arabic text analysis. The system achieved exceptional accuracy rates (99.87% for depression, 98.93% for anxiety) while implementing an intelligent three-tier alert mechanism for longitudinal monitoring.

Important Note

Despite its high performance, this model serves as a supportive tool for mental health specialists and does not replace clinical judgment. The system is designed to augment, not replace, professional mental health care.

Future Work

Clinical validation against standardized diagnostic assessments
Domain-specific fine-tuning of Arabic embeddings for mental health contexts
Sequential modeling approaches (e.g., LSTM networks) to incorporate longitudinal user context
Expansion to include dialect-specific modeling
Multi-task learning for comorbidity modeling

Impact & Significance

This research establishes a foundation for more proactive, accurate, and ethically responsible mental health monitoring aligned with global sustainability goals and supports the development of a more resilient and mentally aware society.

AI-Based Longitudinal Tracking for Early Prediction of Mental Health Decline with Predictive Alerts

Table of Contents

Abstract

1. Research Overview

Key Innovation

The Problem

The Solution

3. Methodology

Dataset

Technical Architecture

Feature Engineering

Classification Models

4. Results & Performance

Depression Score Prediction Model

Anxiety Score Prediction Model

ROC Curves & AUC Scores

Depression Model AUC

Anxiety Model AUC

AUC Interpretation

5. Alert System

1. High Score Alerts

2. Worsening Trend Alerts

3. Sudden Spike Alerts

6. Key Findings

Outstanding Performance

Model Comparison

Confusion Matrix Analysis

Depression Model Observations

Anxiety Model Observations

Clinical Interpretation

7. Limitations

Technical Limitations

Clinical & Practical Limitations

8. Ethical Considerations

Responsible AI Development

9. Lessons Learned

10. Conclusion

Important Note

Future Work

Impact & Significance