Data Generation and Preparation¶
Synthetic Longitudinal Arabic Mental Health Dataset¶
To enable trajectory-based experimentation while preserving ethical safeguards, this project employs a fully synthetic longitudinal dataset designed to simulate realistic mental health narratives over time.
The dataset supports severity prediction and structured alert modeling without involving real clinical records.
Dataset Structure¶
The dataset contains 2,500 Arabic text entries corresponding to 100 synthetic participants, each with 25 time-ordered entries. This structure enables modeling of symptom progression across multiple time points.
Each entry includes:
Participant_IDDate(time index for longitudinal tracking)Arabic_TextDepression_ScoreAnxiety_Score
Severity labels follow a four-level ordinal scale:
| Score | Severity |
|---|---|
| 0 | None |
| 1 | Mild |
| 2 | Moderate |
| 3 | Severe |
The labeling scheme is inspired by established screening instruments such as PHQ-9 and GAD-7. The scores approximate clinically meaningful distinctions while remaining non-diagnostic.
Data Generation Methodology¶
Text entries were generated using AI-assisted text synthesis to simulate self-reported mental health narratives in Arabic.
Generation was guided by the following design principles:
- Representation of gradual symptom progression
- Inclusion of fluctuations and episodic worsening
- Consistency within participant-level trajectories
- Variation in linguistic tone and intensity across severity levels
The synthetic approach ensures controlled class distribution while preserving ethical integrity.
No real individuals, social media content, or clinical records were used.
Ethical Rationale for Synthetic Design¶
Using synthetic data provides:
- Privacy preservation by design
- Elimination of identifiable health information
- Avoidance of consent-related risks
- Controlled experimentation without harm
This dataset is intended exclusively for research and educational purposes.
Data Splitting Strategy¶
To evaluate model generalization fairly, the dataset was partitioned using stratified random sampling:
- Training set: 1,750 entries (70 percent)
- Testing set: 750 entries (30 percent)
Stratification preserves proportional representation of all severity levels across subsets, ensuring balanced evaluation across classes.
Text Preparation and Feature Representation¶
Minimal preprocessing was applied to preserve semantic richness:
- Removal of extraneous formatting artifacts
- Retention of original sentence-level structure
Rather than manual feature engineering, text entries were converted into dense numerical vectors using a multilingual embedding model with 768-dimensional output.
This embedding-based approach captures contextual semantics and reduces reliance on handcrafted linguistic features.
Further details of the embedding architecture and classification framework are provided in the Models section.
Dataset Limitations¶
Despite supporting controlled longitudinal experimentation, the dataset has important constraints:
- Severity labels are simulated rather than clinically validated
- Dialectal and colloquial diversity within Arabic is not explicitly modeled
- Implicit, metaphorical, or culturally nuanced expressions may not be fully represented
- Synthetic distributions may exhibit clearer class boundaries than real-world data
These constraints limit direct generalization and motivate future real-world validation.