Review Article | Health Policy & Methodology Science: Public Health
Deep learning-based screening for depressive disorders using audio-visual text multimodal information: a narrative review
Abstract
Background and Objective: Depression prevention, especially early screening, is deficient. Multimodal deep learning based on non-invasive information (video, audio and text) shows promise for rapid screening at low cost. This study reviewed existing studies and discussed multimodal deep learning based methods in screening for depressive disorders from psychiatric and psychological perspectives.
Methods: Up to October 2024, Web of Science, Scopus, PubMed, IEEE Xplore, Google Scholar, and Embase were searched and English-language studies of depression were screened using multimodal data based on video audio and text, and deep-learning-based methods. The aim of the target study must be to screen for depression or to make some prediction of the level of depression.
Key Content and Findings: Of the 1,615 studies retrieved, 26 met the inclusion criteria. In multimodal depression detection, algorithms based on pre-trained models are currently at the forefront of current research, audio and video are optimal in bimodal combinations, and there is a large variability in trimodal combinations. Decision-level fusion and feature fusion have more applications, but hybrid fusion is promising. Multimodal deep learning achieved the highest accuracy of 93.8% for the binary classification of depressive disorder, while its lowest root mean square error (RMSE) was 3.18 in predicting depressive disorder severity.
Conclusions: Multimodal deep learning based on video, audio, and text modalities can be used to screen for depression, and current algorithmic studies based on a small number of datasets are deficient. Promoting interdisciplinary research is crucial for depression screening in the future.
