Behavioral Classification Pipeline for Clinical Interview Transcripts

Automated multi-label behavioral classification from clinical interview transcripts using generative AI and comparison with classical embedding + XGBoost approaches - evaluated against 26 behavioral labelled dataset.

Overview

Behavioral scientists manually annotated clinical interview transcripts using 26 behavioral labels across Risk, Effort, and Social Influence categories - a costly and time-intensive workflow. The objective was to automate multi-label behavioral classification from transcript passages while handling overlapping behavioral definitions. Transcript data was stored and processed through AWS S3 storage infrastructure.

Approach 1: LLM-Based Prompt Engineering

Goal was to build a domain-adapted LLM behavioral classification pipeline that can predict and reason about behavioral labels given the interview transcript. Evaluated multiple generative AI solutions with local and cloud LLMs for multi-label behavioral prediction using structured prompt engineering and behavioral-definition alignment.

Zero-shot and few-shot prompt design with curated behavioral examples
Structured multi-label outputs with ranked confidence and rationale generation
Controlled decoding parameters for deterministic classification behavior
Comparative evaluation across local (Ollama-hosted) and API-based models

Models Evaluated

Qwen2.5Llama 3.1MistralOpenAI GPTGemini

Approach 2: Text Embeddings + XGBoost Classifier

A classical NLP pipeline using semantic embeddings and XGBoost-based multi-label classification.

Convert each manually annotated transcript passage into a dense semantic vector capturing behavioral context
XGBoost multi-label classifier for per-label probability prediction
Per-class threshold tuning to address behavioral label imbalance
Precision, recall analysis across common and underrepresented classes

Key Learnings

Not every problem needs an LLM: the classical embedding and XGBoost pipeline performed on par with LLM-based approaches; sometimes an out-of-the-box classical ML solution is the right solution for the dataset
LLMs have fundamentally different behavioral prior definitions: the definitions set by behavioral scientists during annotation were different from how these concepts are represented in an LLM's pre-training data, leading to definition confliction that prompting alone couldn't resolve
LLMs struggle with nuanced distinctions between similar behaviors: Overlapping labels cause confusion; strict prompt constraints and few-shot anchoring were essential to reduce label hallucination
Data imbalance handling is critical: under-represented behavioral labels require per-class threshold calibration, and balanced representation across splits is essential for reliable evaluation

Tech Stack

LLM Pipeline

OllamaOpenAI GPTGeminiQwen2.5Llama 3.1Mistral

Classical ML

Sentence TransformersXGBoostscikit-learn

Infrastructure

AWS S3AWS Lambda

Evaluation

pandasmatplotlibPer-class Threshold Tuning

Want to Work Together?

Need intelligent classification pipelines for domain-specific or behavioral data? Whether prompt engineering, fine-tuning, or classical ML - I can design the right solution for your use case.

✉ Get in Touch

Previous ProjectF-16 Technical Manual RAG Pipeline