Evaluation Methodology¶
Quantitative and Class-Level Assessment Framework¶
This section describes the evaluation protocol used to assess the predictive performance and robustness of the dual-model framework for depression and anxiety severity classification.
The evaluation emphasizes both aggregate performance metrics and detailed class-level behavior.
Evaluation Objectives¶
The evaluation is designed to:
- Quantify classification performance across all severity levels
- Assess robustness for higher-severity cases
- Examine ordinal boundary confusion
- Compare performance between depression and anxiety models
- Ensure reproducibility and transparent reporting
Train–Test Partitioning¶
The dataset was divided using stratified random sampling to preserve severity-level proportions:
- Training set: 1,750 entries (70 percent)
- Testing set: 750 entries (30 percent)
Stratification ensures that none, mild, moderate, and severe categories are proportionally represented in both subsets, reducing class imbalance bias.
All reported metrics are computed exclusively on the held-out test set.
Evaluation Metrics¶
Performance was assessed using standard multi-class classification metrics:
- Accuracy: Overall proportion of correct predictions
- Precision (per class): True positives divided by predicted positives
- Recall (per class): True positives divided by actual positives
- F1-score (per class): Harmonic mean of precision and recall
To provide balanced evaluation:
- Macro-average metrics treat all classes equally
- Weighted-average metrics account for class support
Both reporting styles are included to ensure fair interpretation across severity levels.
Class-Level Performance Analysis¶
Given the ordinal nature of severity levels, analysis focused on:
- Confusion between adjacent categories such as moderate and severe
- Stability in identifying higher-risk classes
- Rare extreme misclassifications such as severe predicted as none
Because mental health severity exists on a continuum, minor overlap between neighboring classes is expected and does not necessarily indicate instability.
Confusion Matrix Examination¶
Confusion matrices were analyzed to visualize prediction distributions.
Key interpretation criteria:
- Strong diagonal dominance indicates reliable class separation
- Off-diagonal concentration near adjacent classes reflects ordinal boundary overlap
- Minimal severe-to-none misclassification supports high-risk detection reliability
ROC and AUC Evaluation¶
Receiver Operating Characteristic curves were computed using a one-vs-rest strategy for each severity class.
The Area Under the Curve (AUC) measures discriminative ability independent of a fixed decision threshold.
AUC analysis provides:
- Class-wise separability insight
- Comparative difficulty between depression and anxiety models
- Stability across varying classification thresholds
Dual-Model Comparison¶
Separate SVM classifiers were trained for:
- Depression severity
- Anxiety severity
Comparative analysis evaluates:
- Relative performance across equivalent severity classes
- Sensitivity to higher-severity cases
- Differences in boundary ambiguity
Observed performance differences are discussed in the Results section.
Reproducibility Considerations¶
- Stratified sampling ensures consistent severity distribution
- Hyperparameters are tuned to optimize generalization
- Models are serialized for consistent reuse
- Evaluation is performed on a fixed held-out test set
These measures support replicability and transparent review.
Evaluation Limitations¶
The evaluation protocol is subject to constraints:
- Labels are simulated rather than clinically verified
- Predictions are generated at the entry level without sequence modeling
- Synthetic data may produce clearer separability than real-world narratives
As a result, reported performance reflects controlled experimental conditions rather than clinical validation.
Further discussion appears in the Limitations and Future Work sections.