Preprint 2026

Multi-Objective Alignment of Language Models for Personalized Psychotherapy

Mehrab Beikzadeh1, Yasaman Asadollah salmanpour2, Ashima Suvarna1, Sriram Sankararaman1, Matteo Malgaroli3, Majid Sarrafzadeh1, Saadia Gabriel1
1 UCLA Computer Science    2 UT Austin Psychology    3 NYU Grossman School of Medicine

Abstract

Mental health disorders affect over 1 billion people worldwide, yet access to care remains limited by workforce shortages and cost constraints. While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.

We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization. We train reward models for six criteria—empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy—and systematically compare multi-objective approaches against single-objective optimization, supervised fine-tuning, and parameter merging.

Multi-objective DPO (MODPO) achieves superior balance (77.6% empathy, 62.6% safety) compared to single-objective optimization (93.6% empathy, 47.8% safety), and therapeutic criteria outperform general communication principles by 17.2%. Blinded clinician evaluation confirms MODPO is consistently preferred, with LLM-evaluator agreement comparable to inter-clinician reliability.

Key Findings

vs. Base Model
99.3%
MODPO Survey preference rate over the base model on overall therapeutic preference
vs. General Criteria
65.5%
MODPO Survey preference rate over MODPO Maxim (Grice's Maxims) on overall preference
Safety–Empathy Balance
77.6% / 62.6%
MODPO empathy and safety win rates — balancing both objectives simultaneously
LLM–Human Agreement
70.3%
LLM-evaluator agreement with clinicians on overall preference, exceeding the 65.5% human baseline
Clinician Validated
85%
Clinician preference for MODPO on empathy — preferred across all therapeutic dimensions
Patient Participants
335
Individuals with lived mental health experience surveyed for therapeutic preference rankings

The Safety–Empathy Tradeoff

Single-objective optimization forces a tradeoff: maximizing empathy (93.6%) comes at the cost of safety (47.8%). Multi-objective approaches like MODPO find the upper-right region of the space — high on both dimensions simultaneously.

Safety-empathy scatter plot
Figure 2. Safety-empathy performance across training approaches. Each point shows a model's average win rate computed across all pairwise head-to-head comparisons.

Clinician Validation

Six licensed mental health clinicians conducted blinded evaluations comparing MODPO Survey against the base model across 100 therapeutic scenarios. MODPO was preferred on every dimension.

Clinician evaluation win rates
Figure 5a. Clinician win rates by criterion. MODPO Survey is consistently preferred over the base model across all seven evaluation dimensions.

Method

Our approach combines patient-centered preference collection with multi-objective direct preference optimization (MODPO). We survey individuals with lived mental health experience, train separate reward models for each therapeutic criterion, and use MODPO's margin-based framework to simultaneously optimize across all objectives while treating safety as a non-negotiable constraint.

Complete methodological pipeline
Figure 6. Complete methodological pipeline showing the flow from dataset preparation through final evaluation across both experimental phases.

Citation

@article{beikzadeh2026modpo,
  title={Multi-Objective Alignment of Language Models
         for Personalized Psychotherapy},
  author={Beikzadeh, Mehrab and Asadollah salmanpour,
          Yasaman and Suvarna, Ashima and Sankararaman,
          Sriram and Malgaroli, Matteo and Sarrafzadeh,
          Majid and Gabriel, Saadia},
  journal={arXiv preprint arXiv:2602.16053},
  year={2026}
}