Skip to content

Data Generation and Preparation

Synthetic Longitudinal Arabic Mental Health Dataset

To enable trajectory-based experimentation while preserving ethical safeguards, this project employs a fully synthetic longitudinal dataset designed to simulate realistic mental health narratives over time.

The dataset supports severity prediction and structured alert modeling without involving real clinical records.


Dataset Structure

The dataset contains 2,500 Arabic text entries corresponding to 100 synthetic participants, each with 25 time-ordered entries. This structure enables modeling of symptom progression across multiple time points.

Each entry includes:

  • Participant_ID
  • Date (time index for longitudinal tracking)
  • Arabic_Text
  • Depression_Score
  • Anxiety_Score

Severity labels follow a four-level ordinal scale:

Score Severity
0 None
1 Mild
2 Moderate
3 Severe

The labeling scheme is inspired by established screening instruments such as PHQ-9 and GAD-7. The scores approximate clinically meaningful distinctions while remaining non-diagnostic.


Data Generation Methodology

Text entries were generated using AI-assisted text synthesis to simulate self-reported mental health narratives in Arabic.

Generation was guided by the following design principles:

  • Representation of gradual symptom progression
  • Inclusion of fluctuations and episodic worsening
  • Consistency within participant-level trajectories
  • Variation in linguistic tone and intensity across severity levels

The synthetic approach ensures controlled class distribution while preserving ethical integrity.

No real individuals, social media content, or clinical records were used.


Ethical Rationale for Synthetic Design

Using synthetic data provides:

  • Privacy preservation by design
  • Elimination of identifiable health information
  • Avoidance of consent-related risks
  • Controlled experimentation without harm

This dataset is intended exclusively for research and educational purposes.


Data Splitting Strategy

To evaluate model generalization fairly, the dataset was partitioned using stratified random sampling:

  • Training set: 1,750 entries (70 percent)
  • Testing set: 750 entries (30 percent)

Stratification preserves proportional representation of all severity levels across subsets, ensuring balanced evaluation across classes.


Text Preparation and Feature Representation

Minimal preprocessing was applied to preserve semantic richness:

  • Removal of extraneous formatting artifacts
  • Retention of original sentence-level structure

Rather than manual feature engineering, text entries were converted into dense numerical vectors using a multilingual embedding model with 768-dimensional output.

This embedding-based approach captures contextual semantics and reduces reliance on handcrafted linguistic features.

Further details of the embedding architecture and classification framework are provided in the Models section.


Dataset Limitations

Despite supporting controlled longitudinal experimentation, the dataset has important constraints:

  • Severity labels are simulated rather than clinically validated
  • Dialectal and colloquial diversity within Arabic is not explicitly modeled
  • Implicit, metaphorical, or culturally nuanced expressions may not be fully represented
  • Synthetic distributions may exhibit clearer class boundaries than real-world data

These constraints limit direct generalization and motivate future real-world validation.