Deep learning-based screening for depressive disorders using audio-visual text multimodal information: a narrative review

Zhong Ding; Yan-Min Xu; Chen-Ling Liu; An-Jie Dai; Zhen-Tao Liu; Bao-Liang Zhong

doi:10.21037/amj-25-6

Review Article | Health Policy & Methodology Science: Public Health

Deep learning-based screening for depressive disorders using audio-visual text multimodal information: a narrative review

Zhong Ding^1,2,3# , Yan-Min Xu^2#, Chen-Ling Liu³, An-Jie Dai⁴, Zhen-Tao Liu⁴, Bao-Liang Zhong^2,3

¹Department of Psychological and Behavioral Sciences, Zhejiang University, Hangzhou, China; ²Department of Psychiatry, Wuhan Mental Health Center, Wuhan, China; ³Research Center for Psychological and Health Sciences, China University of Geosciences (Wuhan), Wuhan, China; ⁴School of Automation, China University of Geosciences (Wuhan), Wuhan, China

Contributions: (I) Conception and design: All authors; (II) Administrative support: CL Liu, BL Zhong; (III) Provision of study materials or patients: Z Ding, ZT Liu; (IV) Collection and assembly of data: All authors; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#The authors contributed equally to this work.

Correspondence to: Bao-Liang Zhong, MD, PhD. Department of Psychiatry, Wuhan Mental Health Center, 89 Gongnongbing Road, Jiang’an District, Wuhan 430012, China; Research Center for Psychological and Health Sciences, China University of Geosciences (Wuhan), Wuhan, China. Email: haizhilan@gmail.com.

Background and Objective: Depression prevention, especially early screening, is deficient. Multimodal deep learning based on non-invasive information (video, audio and text) shows promise for rapid screening at low cost. This study reviewed existing studies and discussed multimodal deep learning based methods in screening for depressive disorders from psychiatric and psychological perspectives.

Methods: Up to October 2024, Web of Science, Scopus, PubMed, IEEE Xplore, Google Scholar, and Embase were searched and English-language studies of depression were screened using multimodal data based on video audio and text, and deep-learning-based methods. The aim of the target study must be to screen for depression or to make some prediction of the level of depression.

Key Content and Findings: Of the 1,615 studies retrieved, 26 met the inclusion criteria. In multimodal depression detection, algorithms based on pre-trained models are currently at the forefront of current research, audio and video are optimal in bimodal combinations, and there is a large variability in trimodal combinations. Decision-level fusion and feature fusion have more applications, but hybrid fusion is promising. Multimodal deep learning achieved the highest accuracy of 93.8% for the binary classification of depressive disorder, while its lowest root mean square error (RMSE) was 3.18 in predicting depressive disorder severity.

Conclusions: Multimodal deep learning based on video, audio, and text modalities can be used to screen for depression, and current algorithmic studies based on a small number of datasets are deficient. Promoting interdisciplinary research is crucial for depression screening in the future.

Keywords: Depressive disorder; deep learning; multimodal; screening; non-intrusive

Received: 26 January 2025; Accepted: 06 August 2025; Published online: 12 November 2025.

doi: 10.21037/amj-25-6

Introduction

Background

The report from the World Health Organization underscores that depressive disorder constitutes a principal contributor to global disease burden, with major depressive disorder (MDD) ranking as a leading cause of mortality among adolescents and young adults worldwide (1). According to the Global Health Data Survey, an estimated 280 million individuals (3.8% of the global population) are afflicted by depressive disorder, encompassing 5% of the adult population (4% of males and 6% of females), and 5.7% of those aged over 60 years (2). Annually, over 700,000 individuals succumb to suicide, marking it as the fourth leading cause of death in the 15–29 years age group (3). Despite the availability of efficacious preventative and therapeutic interventions for depressive disorders (4), over 75% of individuals in low-income and developing nations do not receive timely and effective treatment during the initial stages of these conditions (5). The exacerbation of depressive disorders complicates treatment approaches and prognoses, thereby highlighting the critical role of early depressive disorder screening (6). However, a pronounced deficit in psychiatric and psychological professionals impedes extensive screening and preventative measures (4). Historically, early screening for depressive disorders has predominantly relied on self-assessment tools such as the Beck Depression Inventory (BDI-II) (7), the Self-Rating Depression Scale (SDS) (5), the Patient Health Questionnaire-9 (PHQ-9) (8), and the Hamilton Rating Scale for Depression (HAMD) (9). These instruments are advantageous due to their simplicity, accessibility, and cost-effectiveness, which facilitate widespread implementation (10). However, these scales demand a certain level of literacy and language comprehension, rendering them ineffective for individuals with cognitive impairments (5). Moreover, the reliance on self-reporting may yield unreliable outcomes, particularly when subjects conceal their symptoms due to stigma or social desirability biases, thus elevating the risk of diagnostic inaccuracies (11,12). Consequently, there is a pressing need for an efficient, accurate, and direct method to screen for depressive disorder. Emerging research in machine learning and deep learning within affective computing, particularly in the realm of physiological-psychological computing, offers promising avenues for enhancing depressive disorder screening methodologies (13-16). The use of machine learning to analyze EEG signals has been successful in differentiating depression, yet neuroimaging-based depression screening has struggled to be applied on a large scale due to invasive damage and high cost (17,18). In addition, identifying potentially depressed individuals through social media analytics, while innovative, does not effectively reach less web-active populations, such as adolescents and older adults, who are more susceptible to the disease (19,20). It is essential to acknowledge that while medical imaging modalities such as magnetic resonance imaging (MRI), electrocardiograms (ECG), electroencephalography (EEG), and galvanic skin response (GSR) offer precise diagnostic capabilities, their utility in the widespread screening of depressive disorders is significantly constrained. These modalities require substantial medical equipment and must be conducted in the hospital, which imposes logistical limitations on their deployment, particularly in regions with restricted economic resources. On the contrary, there is a lower threshold for collecting audio and video information through interviews, making collecting large-scale audio and video data for screening more feasible. Overall, depression identification based on noninvasive modalities is a potentially viable screening method for depressive disorders that requires more in-depth research (21).

Depression recognition has received a lot of attention from researchers in the field of computer science in recent years, but it is clear that there has been insufficient attention from researchers with backgrounds in psychiatry and psychology, which has made it difficult for this technology to move into clinical applications. In addition, to date, no study has focused on deep learning-based noninvasive multimodal screening for depressive disorders from a clinical application perspective (19). Therefore, this review seeks to concisely orient psychiatry and psychology researchers to the overall progress and key technical points of this technology in screening for depressive disorders. This study will contribute to the intersection of artificial intelligence in psychiatry and psychology.

Rationale and knowledge gap

To date, no study has focused on deep learning-based noninvasive multimodal screening for depressive disorders from a clinical application perspective (22). Therefore, this review seeks to concisely orient psychiatry and psychology researchers to the overall progress and key technical points of this technology in screening for depressive disorders. This study will contribute to the intersection of artificial intelligence in psychiatry and psychology.

Objective

This article aims to provide more knowledge about deep learning and audio-video and text-based depression screening methods. We present this article in accordance with the Narrative Review reporting checklist (available at https://amj.amegroups.com/article/view/10.21037/amj-25-6/rc).

Methods

In this review, we included deep learning-based studies on multimodal Depressive disorder screening for audiovisual texts (Table 1). In line with the principles of narrative reviews (23), we conducted bibliographic searches from the beginning to October, 2024 from Web of Science, Scopus, PubMed, IEEE Xplore, Google Scholar, and Embase. In order to find the studies that match the requirements, we designed the following query string: “(Depression*) AND ((Audio) OR (Voice) OR (Speech) OR (Speech) OR (Video) OR (Expression) OR (Image) OR (Text) OR (Semantics) OR (Text)) AND ((Deep Learning) OR (Neural Networks) OR (CNN) OR (DNN) OR (LSTM) OR (Bi-LSTM) OR (Transformer) OR (BERT)) AND ((Multimodality) OR (Bimodality) OR (Trimodality))”. Strings were fabricated based on logical operators common to the dataset to facilitate better relationships between the terms searched for. All titles and abstracts of the articles were queried using the filters of the dataset.

Table 1

The search strategy summary

Items	Specification
Date of search	October 30, 2024
Databases and other sources searched	Web of Science, Scopus, PubMed, IEEE Xplore, Google Scholar, and Embase
Search terms used	”(Depression*) AND ((Audio) OR (Voice) OR (Speech) OR (Video) OR (Expression) OR (Image) OR (Text) OR (Semantics)) AND ((Deep Learning) OR (Neural Networks) OR (CNN) OR (DNN) OR (LSTM) OR (Bi-LSTM) OR (Transformer) OR (BERT)) AND ((Multimodality) OR (Bimodality) OR (Trimodality))”
Timeframe	2010–2024
Inclusion and exclusion criteria	Inclusion criteria: articles written in English
Inclusion and exclusion criteria	Exclusion criteria: articles with only one modality or out of speech, video and text
Selection process	Z.D. selected the papers

BERT, Bidirectional Encoder Representations from Transformers; Bi-LSTM, bidirectional long short-term memory; CNN, convolutional neural network; DNN, deep neural network; LSTM, long short-term memory.

After a first screening to remove duplicates and irrelevant studies, all selected papers were ensured published between 2012 and 2024, focusing on Depressive disorder screening with deep learning methods [e.g., multilayer perceptual (MLP) machines, deep neural networks (DNNs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs), etc.]. Then, we retain studies only whose methods are based on video, audio, or text these three modalities, with others removed.

Of the 1,615 studies produced by the search and supplement from references, 26 articles met inclusion criteria totally (see Figure 1). Some of the included papers included studies with several independent methods and were therefore counted separately as one study.

Figure 1 Research search process.

Results

Some of the included papers included studies with several independent methods and were therefore counted separately as one study.

In this section, we summarized the commonly used technical steps for multimodal depressive disorder screening based on audio-visual text through a detailed analysis, that is to say: (I) raw data input; (II) preprocessing segmentation and imbalance processing; (III) extraction of relevant feature values related to depressive disorders; (IV) training of the proposed neural network; (V) multimodality fusion mechanism; (VI) outputting classifications or scores; and (VII) validation on the validation set or test set. The basic research steps of the multimodal depressive disorder screening study based on audiovisual text are shown in Figure 2. Also see Table 2 for details of identified studies, including datasets, network architect, input modalities, and fusion methods.

Figure 2 The fundamental research procedure of the method.

Table 2

Summary of main information of included studies

Study	Key contributions	Dataset	Deep learning	Input modalities	Fusion methods	Regression/classification
Yang et al. 2017 (24)	An audiovisual multimodal depressive disorder screening framework consisting of SVM DCNN and DNN models	DAIC-WOZ	SVM + DCNN-DNN	A + V + T	EF/LD	Regression
Haque et al. 2018 (25)	Two technical parts: (I) sentence-level “summary” embedding; (II) causal CNN	DAIC-WOZ	C-CNN/LSTM	A + V + T	EF	Both
Al Hanai et al. 2018 (26)	A LSTM neural network model without explicit topic modeling of content	DAIC-WOZ	LSTM	A + T	LD	Both
Yang et al. 2018 (27)	New text and video features, for the first time by hybrid deep and shallow models from audio, video and text descriptors	DAIC-WOZ	SVM + DCNN-DNN	A + V + T	LD	Both
Qureshi et al. 2019 (28)	A new multi-task learning attention-based deep neural network model	DAIC-WOZ	MT-DLC/R-Comb-Att	Multimodal	EF	Both
Rohanian et al. 2019 (29)	A deep learning model that merges multiple types of data word-by-word using a time-based method. It has special controls to adjust the influence of sound and visual data to improve accuracy	DAIC-WOZ	LSTM/LSTM + gating	A + V + T/A + T	LD	Both
Lam et al. 2019 (30)	Combining a data augmentation process based on topic modeling, using Transformer and 1D convolutional neural network for acoustic feature modeling	DAIC-WOZ	Transformer + 1D-CNN	A + T	LD	Classification
Ray et al. 2019 (31)	A novel multi-level attention network for multimodal depressive disorder prediction while learning intra-modal and inter-modal correlations	E-DAIC	Bi-LSTM + Mul-ATT-net	A + T/A + V + T/V + T	EF	Regression
Makiuchi et al. 2019 (32)	Proposed a multimodal fusion method for speech and language representations for depressive disorder screening, based on GCNN and LSTM layers	E-DAIC	GCNN-LSTM + 7 CNN blocks-LSTM + GCNN	A + T/A + T/A + V + T	EF	Regression
Niu et al. 2020 (33)	A novel STA network and MAFF strategy to obtain multimodal representations of depressive disorder cues	AVEC2013/2014	STAnet (1/2/3D- CNN + LSTM) + SVR	A + V	EF	Regression
Lin et al. 2020 (34)	Proposed a new method for screening depressive disorder using a network with layers for language, speech signals, and a connected ensemble	DAIC-WOZ	Bi-LSTM + 1D-CNN	A + T	LD	Classification
Zhang et al. 2020 (35)	Proposed a multi-task DNN based on multiDDAE and text PVs	E-DAIC	DNN	A + V/A + T + V	EF	Both
Xiao et al. 2021 (36)	Proposed an attention-C-CNN based audio sequence modeling and Bert-based text sequence modeling model	DAIC-WOZ	BERT + Att-C-CNN	A + T	LD	Classification
Solieman and Pustozerov 2021 (37)	Proposed a neural network model to screen depressive disorder based on audio information and speech quality analysis in breath dimension	DAIC-WOZ	LSTM + CNN	A + T	LD	Classification
Oureshi et al. 2021 (38)	Proposed a neural network model based on LSTM and adversarial learning of gender information to screen depressive disorder	DAIC-WOZ	Gen-concat (LSTM + Multi-Task)	A + V + T	HF	Regression
Sun et al. 2021 (39)	A Transformer-based multimodal adaptive fusion transformer network for estimating depressive disorder level, combined with multi-task learning	E-DAIC	Transformer + Multi-Task Learning	A + T	LD/HF	Regression
Sun et al. 2022 (40)	A transformer-based autoencoder for screening depressive disorder from multiple data types, and a network that fuses deep features across modalities to analyze audiovisual data at the sentence level	DAIC-WOZ	DDFN	A + V + T	LD	Both
Guo et al. 2022 (41)	TOAT model, a topic attention module is designed to learn the importance of each topic, and a dual-branch architecture with a late fusion strategy is also designed to build the TOAT model	DAIC-WOZ	TOAT	A + T	EF	Classification
Chen et al. 2022 (42)	A GNN architecture that handles both shared and specific features across modalities, uses reconstruction networks for single modality accuracy, and applies attention mechanisms for better fusion	DAIC-WOZ	MS2-GNN	A + V + T	HF	Classification
Rasipuram et al. 2022 (43)	Focused on methods for screening depressive disorder from multimodal signals using task-specific representations (using task-oriented embeddings generated via a GPT2 mediated language model) and late fusion	DAIC-WOZ	Bi-LSTM + LSTM + NN	A + T/A + V + T/V + T	LD	Regression
Fang et al. 2023 (44)	Systematically analyzed audio-visual and textual data related to depressive disorder and proposed a multimodal fusion model (MFM-Att) with multi-level attention mechanism for depressive disorder screening	DAIC-WOZ/E-DAIC	MFM-Att	A + V/A + V + T/V + T	EF	Regression
Saggu et al. 2022 (45)	Proposed a three-stage framework multimodal deep learning approach called DepressNet based on a Bi-LSTM layer network	E-DAIC	Bi-LSTM + Depress-Net	A + V + T	EF	Regression
Tasnim et al. 2023 (46)	Proposed an NLP-based multimodal system that integrates deep-learned features from audio and text, as well as hand-crafted features informed by clinically validated domain knowledge	DEPAC corpus	Ro-BERT + Bi-LSTM	A + T	EF	Classification
Aloshban et al. 2022 (47)	Proposed a deep learning based depressive disorder screening network designed for sequence modeling (Bi-LSTM) and multimodal analysis methods (late fusion, joint representation, and gated multimodal units)	DEPAC corpus	BLSTM	A + T	LD	Classification
Uddin 2022 (48)	Proposed a dynamic feature descriptor based on VLDSP, TAP and MFB strategies for deep multimodality State Hospital identification Framework	AVEC2013/2014	Spatio-Temporal Network	A + V	EF	Regression
Huang 2022 (49)	Designed a novel autoencoder framework based on ResNet50, BERT, and memory fusion networks to screen depressive disorder	Self-built dataset	MFN + MFM + BERT-ResNet50	V + T	LD	Regression

Datasets: DAIC-WOZ, the Distress Analysis Interview Corpus-Wizard of Oz; E-DAIC, the Extended Wizard of Oz dataset; self-built, an independent dataset built by itself; DEPAC, the multimodal mood disorder dataset. Modality: A, audio; V, video; T, text; A + V, audio and video bimodal; A + T, audio and text bimodal; V + T, video and text bimodal; A + V + T, video-audio and text trimodal. Multimodality fusion methods: EF, early feature-level fusion; HF, network hybrid fusion; LD, late decision-level fusion. Outcome metrics: regression, regression studies predicting depressive disorder severity scores; categorization, categorization predicting the presence or absence of major depressive disorder. BERT, Bidirectional Encoder Representations from Transformers; Bi-LSTM, bidirectional long short-term memory; CNN, convolutional neural network; DCNN, deep convolutional neural network; DDFN, Deep Decoupling Fusion Network; DNN, deep neural network; GCNN, gated convolutional neural network; GNN, graph neural network; LSTM, long short-term memory; MAFF, multimodal attention feature fusion; MFB, Multimodal Factorized Bilinear Pooling; multiDDAE, multimodal deep denoising autoencoder; NLP, natural language processing; PVs, paragraph vectors; STA, spatiotemporal attention; SVM, support vector machine; SVR, support vector regression; TAP, Temporal Attention Pooling; TOAT, target-oriented attention; VLDSP, Volumetric Local Directional Structure Pattern.

Significantly, unlike general psychiatry studies, all included studies were based on publicly available datasets except for one paper with four studies whose data were derived from a self-constructed dataset. Thirty-two studies from 16 papers were based on the Distress Analysis Interview Corpus—Wizard of Oz (DAIC-WOZ), as the designated dataset for the continuous Audio/Visual Emotion and depression recognition Challenge (AVEC 2016/2017) (50). Thirteen studies from 6 papers were based on the Extended Wizard of Oz dataset (E-DAIC), as an extended version of the DAIC-WOZ, which was collected through semi-clinical interviews designed to support the diagnosis of psychologically distressing conditions such as depressive disorder, was used in AVEC2018 (51). Two studies from two papers were based on the DEPAC dataset, as a large audio dataset for mental distress analysis presented in 2022. Five studies from two papers were derived from the AVEC2013/2014 (52,53). Limited by insufficient datasets and lack of validation with external data, these studies are currently in the laboratory stage and fall short in terms of external generalizability and generalizability.

Deep learning methods

Deep learning is a new high level of artificial intelligence development, belonging to the machine learning subfield, which has boosted the development of various application areas (54). Unlike traditional machine learning, deep learning is capable of automatically learning features from large amounts of data without explicit programming and relational presuppositions (55). We found that the deep learning methods involved in the study exhibit significant temporal variability, as shown in Figure 3.

Figure 3 Dominant deep learning methods over time. ATT, attention; BERT, Bidirectional Encoder Representations from Transformers; Bi-LSTM, bidirectional long short-term memory; C-CNN, cascaded convolutional neural network; CNN, convolutional neural network; DCNN, deep convolutional neural network; DNN, deep neural network; DNN, deep neural network; LSTM, long short-term memory; MLP, multilayer perceptron; SVM, support vector machine; SVR, support vector regression.

In the early stage, three studies utilizing a hybrid approach of machine learning and deep learning for multimodal depressive disorder screening were the first to emerge, accounting for approximately 12% of the studies, with an accuracy of up to 0.937 in determining the presence of major depressive disorder (24,27). In the second period, algorithms that perform well in unimodal tasks [e.g., CNNs in video and long short-term memory (LSTM) in audio] began to appear in multimodal depressive disorder screening, with 17 such studies included, accounting for 65%, and achieving a minimum RMSE of 4.28 and an accuracy of 0.938 (31,40). In the third stage, pre-trained algorithms that have achieved some success in predicting time series data are applied. Six such studies were included in this study, accounting for 23% of the total, with a best RMSE of 3.78 and an accuracy of 0.925 (40,43). During this period, multi-task learning and attention mechanisms were also adopted into the research as a compatible and innovative approach (28,38,39).

The advancement of deep learning algorithms has improved the accuracy of Depressive disorder screening, and the current rapid development of large language models (e.g., GPT) is expected to further boost the field. However, algorithms are not the only factor; data, fusion methods and features all have a significant impact on the effectiveness of the methods.

Features of modalities

Depressive disorder screening requires the extraction of representative features from the raw modality information and input into a model for training. Different features may produce different effects, such as redundancy caused by extracting too many features. Otherwise incomplete feature extraction will affect the model performance (56). This section organizes in detail the features that can be extracted from the three modalities.

The features of audio or speech mainly include sound features such as pitch, tone, amplitude, frequency, etc., which determine the differences of audio harboring depressive cues (57). Three main types of audio features extracted for inclusion in the study are low level descriptors (LLDs), higher order statistics (HOSs) and function generating features (FGFs). Descriptive fundamental quantities such as pitch frequency features, resonance peak features, energy features, and short-term average amplitude features are called low-level descriptors which can be computed directly from audio analysis without processing (24,27,35,44). Statistics including mean, maximum, minimum, median, standard deviation, skewness and kurtosis are obtained after statistical processing of audio low-level descriptors using toolkits such as COVAREP, which are called higher order statistics (HOSs), and need to be processed in a certain way. HOSs are especially used for private data to protect the original data from being disclosed (28,29,37,43,45). The final feature is the audio feature extracted directly using the MFCC function, the eGeMAPS function and the AUpose function, which improves the modality pattern and avoids the side effect of dealing with very long sequences (29,30,58,59).

In terms of facial expression or video representation, depressed individuals are usually characterized by sad expressions, frowning, drooping corners of the mouth, and a dull, drooping gaze (60). In order to measure the displacement and velocity information of facial marker points, some studies have proposed displacement range histograms (24,35). In addition, the facial action coding system action units and head posture have been used to assess the severity of depressive disorder. In the DAIC-WOZ dataset, head position estimation, head rotation, 68 facial marker locations, gaze tracking, and facial action units were provided directly from the video data in order to ensure patient privacy. Dynamic feature descriptors based on volumetrically localized oriented structural patterns and 2D residual networks to extract facial dynamics are also a commonly used video feature.

For textual or semantic modalities, word-level and phoneme-level embeddings have long been the preferred features for encoding text and speech, and have had a significant impact. Haque proposed a novel multimodal sentence-level embedding (25). Al Hanai used Doc2Vec from the Python Gensim library to generate individual responses to all queries as well as the embedding of the query itself (26). Yang applied a paragraph vector model to multimodal deep learning depressive disorder screening (24). Lin and Sun applied ELMo to their model, where the average of all three levels of embeddings in the ELM is represented as a sentence (34,40). Zhang first obtained transcripts of audio recordings (psychotherapy interviews) through open-source APIs such as Google Cloud Platform (GCP) and conducted experiments by applying two architectures in doc2vec, i.e., PV with distributed memory (PV-DM) and PV with Distributed Bag of Words (PVDBOW) (35).

Extracting different features from modality and training based on this is a task worth exploring in depth (61,62). Although deep learning can automatically find associated information from a large amount of data, too much useless information fed into the algorithm will obviously increase the amount of computation, which will increase the cost and take more time at the same time (63). Thus selecting features associated with depressive disorder is an effective means of reducing burden and achieving low-cost goals, but it does require theoretical knowledge of psychiatry.

Multi-modality fusion approaches

Depressive disorders can be diagnosed through multiple dimensions, and multimodality-based methods may have higher good performance than unimodality-based methods (64). The key to the union of multiple modalities is how to achieve the fusion of different modalities (65), and the three main methods of fusion in the included studies include early feature-level fusion, late decision-level fusion, and network hybrid fusion.

Early feature-level fusion is the most commonly used strategy in multimodality fusion, where features extracted from different modalities are concatenated into a single high-dimensional feature vector immediately following feature extraction (32,33). Of the included studies, 23 used an early feature-level fusion approach to fuse multimodal information (25,31,38,66,67). After the results based on each mode are obtained, multiple combination rules (mean, maximum, minimum, etc.) are then applied to merge the results of each mode, which is called decision-level fusion, and 23 studies have used decision-level fusion (30,31,35,45). Network hybrid fusion is a new attempt that can be seen as an intermediate approach between earlier feature fusion and decision fusion methods, which is neither direct brute force feature superposition nor simple statistics of individual modal results, but an attempt to fuse information from different modes through specialized modal fusion networks (68). Of the included studies, decision fusion achieved the best accuracy of 0.938 with the best depressive disorder score prediction RMSE of 3.783. The results demonstrated the effectiveness of decision-level fusion, which is more palatable than direct splicing of features with no theoretical support. In the future, the fusion method of hybrid networks based on the latest theoretical studies of depression in psychiatry is an interpretable fusion method with great potential.

Classification of major depressive disorder

The Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) dataset is often recognized as an authoritative resource within the research community, 49 particularly for its rigorous data collection, multimodal data comprising video, audio, and text, and its clinical validation using standardized measures like the PHQ-8. Developed by the University of Southern California’s Institute for Creative Technologies and utilized in various studies and competitions, it is governed by strict ethical guidelines ensuring its reliability and ethical integrity for research into mental health diagnostics.

The classification of the presence or absence of major depressive disorder is similar to diagnostic tests in psychiatry (69), which can reflect the effectiveness of the diagnosis through accuracy, sensitivity (also known as recall or true instances, and F1 scores (which is a combination of accuracy and recall (70,71). All outcome indicators for the included studies are included in Table 3.

Table 3

Results of classification studies on DAIC-WOZ

Study	Method	Modality	Fusion	Accuracy	Recall	F1 score
Haque et al. 2018 (25)	C-CNN	A + V + T	Feature fusion	0.828	0.833	0.743
Al Hanai et al. 2018 (26)	LSTM	A + T	Decision fusion	0.848	0.830	0.770
Sun et al. 2021 (39)	Trf + CNN + Full	A + T	Decision fusion	0.777	0.750	0.670
	Trf+ CNN + Topic	A + T	Decision fusion	0.800	0.750	0.690
	Trf + CNN + Augm	A + T	Decision fusion	0.925^†	0.830	0.870
Rasipuram et al. 2022 (43)	LSTM	A + V + T	Feature fusion	0.790	0.715	0.670
	LSTM	A + V + T	Decision fusion	0.853	0.635	0.700
	LSTM + Gating	A + T	Feature fusion	0.878	0.821	0.800
	LSTM + Gating	A + V + T	Feature fusion	0.885	0.820	0.810
Yang et al. 2017 (24)	SVM + DCNN-DNN	A + V + T	Decision fusion	0.937^†	0.868	0.892^†
Lin et al. 2020 (34)	Bi-LSTM + 1D-CNN	A + T	Decision fusion	–	0.920^†	0.850
Xiao et al. 2021 (36)	BERT + A-C-CNN	A + T	Decision fusion	0.782	0.800	0.711
Solieman and Pustozerov 2021 (37)	LSTM + CNN	A + T	Decision fusion	0.867	0.850	0.790
Chen et al. 2022 (42)	MS2-GNN	A + V + T	Hybrid fusion	0.891	0.857	0.828
Sun et al. 2022 (40)	DDFN	A + V + T	Decision fusion	0.938^†	0.880	0.890^†

Modality: A, audio; V, video; T, text; A + T, audio and text bimodal; A + V + T, video-audio and text trimodal. Accuracy: measures how often a medical test is correct. Recall: measures a test’s ability to identify a condition correctly. F1 score: balances accuracy and recall, important in medical tests. ^†, the optimal value in its column. Augm, augmented; BERT, Bidirectional Encoder Representations from Transformers; Bi-LSTM, bidirectional long short-term memory; CNN, convolutional neural network; DAIC-WOZ, the Distress Analysis Interview Corpus-Wizard of Oz; DCNN, deep convolutional neural network; DDFN, Deep Decoupling Fusion Network; DNN, deep neural network; GNN, graph neural network; LSTM, long short-term memory; SVM, support vector machine; Trf, Transformer.

Haque proposed a C-CNN based on tri-modal feature fusion method that achieved an accuracy of 0.828 in DAIC-WOZ (25). The LSTM based on the fusion of audio and text bimodal decisions achieves an accuracy of 0.848, outperforming the former (26). A combined LSTM and CNN algorithm based on audio and text bimodality further improved the accuracy to 0.867 (37). An audio and text bimodality fusion method based on a 1D-CNN and bidirectional long short-term memory (Bi-LSTM) algorithm optimized from previous research achieved a sensitivity of 0.92, which is the highest sensitivity in the included literature (34). The combination of the pre-training algorithm Transformer and 1D CNN further improves the accuracy, achieving an accuracy of 0.925 for the audio and text bimodal decision fusion approach (40). In the study of tri-modality fusion, the algorithm based on the fusion of machine learning and deep learning achieved an accuracy of 0.937 (24).

Prediction of depression severity

Evaluating the severity score of depressive disorders is a regression task. The root mean square error (RMSE) and mean absolute error (MAE) of the method are the most applied performance indicators in the included studies (72). The results of the included studies are shown in Tables 4-6.

Table 4

Results of regression studies on the DAIC-WOZ test-set

Study	Method	Modality	Fusion method	RMSE	MAE
Yang et al. 2017 (24)	DCNN-DNN	A + V + T	Feature fusion	5.51 (Dev)	4.74 (Dev)
Yang et al. 2017 (24)	DCNN-DNN	A + V + T	Decision fusion	6.349	5.386
Haque et al. 2018 (25)	C-CNN	A + V + T	Feature fusion	–	3.67
Al Hanai et al. 2018 (26)	LSTM	A + T	Decision fusion	6.27	4.97
Qureshi et al. 2019 (28)	MT-DLC-CombAtt	A + V + T	Feature fusion	4.24^†	3.29^†
Rohanian et al. 2019 (29)	LSTM	A + V + T	Feature fusion	6.68	5.29
	LSTM	A + V + T	Decision fusion	5.86	3.92
	LSTM + Gating	A + T	Feature fusion	5.14	3.66
	LSTM + Gating	A + V + T	Feature fusion	4.99^†	3.61^†
Oureshi et al. 2021 (38)	Gen-concat (LSTM + Multi-Task)	A + V + T	Hybrid fusion	4.48	3.50
	GenASP (LSTM + Multi-Task)	A + V + T	Hybrid fusion	4.72	3.49
	Bi-LSTM + LSTM	A + T	Decision fusion	3.92	3.15
Rasipuram et al. 2022 (43)	NN + LSTM	V + T	Decision fusion	4.12	3.12
Rasipuram et al. 2022 (43)	Bi-LSTM + LSTM + NN	A + V + T	Decision fusion	3.78^†	3.12^†
Fang et al. 2023 (44)	MFM-Att	A + V + T	Feature fusion	3.68	3.18
		A + V	Feature fusion	5.20	4.12
		A + T	Feature fusion	4.00	3.39
		V + T	Feature fusion	3.18^†	3.36^†
Sun et al. 2022 (40)	DDFN	A + V + T	Decision fusion	5.35	3.78

RMSE and MAE are indicators used to predict the difference between the model’s predicted values and the actual observed values. Modality: A, audio; V, video; T, text; A + T, audio and text bimodal; A + V + T, video-audio and text trimodal. ^†, the optimal value in its column. Bi-LSTM, bidirectional long short-term memory; C-CNN, cascaded convolutional neural network; DAIC-WOZ, the Distress Analysis Interview Corpus-Wizard of Oz; DCNN, deep convolutional neural network; DDFN, Deep Decoupling Fusion Network; Dev, development set; DNN, deep neural network; LSTM, long short-term memory; MAE, mean absolute error; MFM-Att, Multimodal Fusion Model with Attention; MT-DLC, Multi-Task Deep Learning Classifier; NN, neural network; RMSE, root mean square error.

Table 5

Results of regression studies on the E-DAIC dataset

Study	Method	Modality	Fusion	Dev. set		Test set
Study	Method	Modality	Fusion	RMSE	CCC	RMSE	CCC
Makiuchi et al. 2019 (32)	GCNN-LSTM (A), 7 CNN blocks-LSTM (T) and GCNN (V)	A + T	Feature fusion	5.08	0.452	6.42	0.213
		A + T	Feature fusion	3.86^†	0.696	6.11	0.403
		A + V + T	Feature fusion	4.86	0.624	–	–
Ray et al. 2019 (31)	Bi-LSTM + Multi-Att	V + T	Feature fusion	4..67	–	4.67	–
		A + T	Feature fusion	4.37	–	4.37	–
		A + V + T	Feature fusion	4.28	–	4.28^†	–
Zhang et al. 2020 (35)	DNN	A + V	Feature fusion	–	–	5.07	0.386
Zhang et al. 2020 (35)	DNN	A + V + T	Feature fusion	–	–	4.48	0.528^†
Sun et al. 2021 (58)	Transformer + Multi-Task	A + T	Decision fusion	3.783^†	0.733^†	–	–
		A + T	Decision fusion	3.852	0.722	–	–
		A + T	Hybrid fusion	3.782^†	0.682	–	–
Saggu et al. 2022 (45)	Bi-LSTM + DepressNet	A + V + T	Feature fusion	4.32	0.662	5.36	0.457
Fang et al. 2023 (44)	MFM-Att	A + V + T	Feature fusion	–	–	5.17	–

RMSE and CCC are indicators used to predict the difference between the model’s predicted values and the actual observed values. Modality: A, audio; V, video; T, text; A + T, audio and text bimodal; A + V + T, video-audio and text trimodal. ^†, the optimal value in its column. Bi-LSTM, bidirectional long short-term memory; CCC, concordance correlation coefficient; CNN, convolutional neural network; Dev., development; DNN, deep neural network; E-DAIC, the Extended Wizard of Oz dataset; GCNN, gated convolutional neural network; LSTM, long short-term memory; MFM-Att, Multimodal Fusion Model with Attention; RMSE, root mean square error.

Table 6

Results of regression studies on AVEC 2013/14 (A + V)

Study	Dataset	Method	Fusion	Development		Test
Study	Dataset	Method	Fusion	RMSE	MAE	RMSE	MAE
Niu et al. 2020 (33)	AVEC2013	3D-CNN + LSTM + SVR	Feature fusion	8.1	6.38	8.16	6.14
	AVEC2014	1D-CNN + LSTM + SVR		6.68	5.07	7.03	5.21
	AVEC2014	2D-CNN + LSTM + SVR		8.00	6.42	7.20	5.61
Uddin 2022 (48)	AVEC2013	Spatio-temporal Network		–	–	6.83	5.38
Uddin 2022 (48)	AVEC2014	Spatio-temporal Network		–	–	6.16	5.03

RMSE and MAE are indicators used to predict the difference between the model’s predicted values and the actual observed values. 1D, one-dimensional; 2D, two-dimensional; 3D, three-dimensional; A, audio; AVEC, Audio/Visual Emotion and depression recognition Challenge; CNN, convolutional neural network; LSTM, long short-term memory; MAE, mean absolute error; RMSE, root mean square error; SVR, Support Vector Regression; V, video.

Qureshi proposed an MT-CombAtt network architecture including multi-task learning, attention fusion mechanism and deep neural network (DNN), achieving better than baseline performance RMSE (4.24) and MAE (3.29) (28). Rohanian proposed word-level multimodality fusion network with gating that performs well on the regression task (29). Multiple experiments were conducted separately, and the experiments with feature-level fusion achieved better MSE (4.99) and MAE (3.61) in the case of three modalities.

Oureshi proposed the use of an adversarial shared private multitasking network to predict the level of depressive disorder in men and women separately (38). The model fully accounted for the effect of gender factors on the accuracy of depressive disorder screening, achieving a MAE of 3.49 better than that of the baseline data. One approach to depressive disorder screening is based on the generation of task-oriented textual embeddings and the subsequent fusion method (43). The model with Bi-LSTM + LSTM + NN architecture achieved relatively optimal RMSE (3.36) and MAE (3.36) in several experiments. The multimodality fusion model with multilevel attention mechanism (MFM-Att) achieved a performance of RMSE (3.18) and MAE (3.68) (44).

Discussion

The purpose of this review is to validate the effectiveness of deep learning algorithms based on video-audio-text in depressive disorder screening, to analyze the technical points and difficulties of this method, and to facilitate the application of this technique in future clinical depressive disorder screening. The findings of this review substantiate the efficacy of a multimodal diagnostic approach—encompassing video, audio, and text—in the screening of depressive disorder. The technique exhibits an accuracy exceeding 0.77 in discerning the presence or absence of major depressive disorder. Furthermore, in regression analyses aimed at predicting the severity of depressive disorder, the RMSE and MAE values reported in the studies are notably low. This methodology aligns seamlessly with contemporary psychiatric paradigms that prioritize early intervention and preventive care over the treatment of severe conditions. By leveraging readily accessible modalities, this approach is optimally positioned for early screening during the initial phases of a depressive episode, supporting a proactive shift in mental health treatment strategies (59).

It is important to highlight that traditional psychiatric studies focusing on diagnostic methods often emphasized evaluating the consistency of results across various samples (73). However, a significant challenge in contemporary research concerning the screening and severity assessment of depressive disorder is the reliance on a limited number of datasets for training and validating studies. This creates a bottleneck due to the absence of multicentered data, which is crucial for enhancing the generalizability and robustness of the findings. Furthermore, the scarcity of external datasets for validation purposes remains a pressing issue (74). Despite some studies achieving high performance metrics, this limitation in data availability directly impedes the clinical applicability and reliability of these diagnostic techniques.

Although these methods achieved good accuracy or low error in the internal dataset of the study, the most striking problem is that there is still a big gap between the application of these studies in clinical depressive disorder screening. The vast majority of these studies have been done by academics in the field of computing, and their experimental specifications, procedures, and objectives do not follow the methods of clinical psychiatry completely. They lack the seriousness of testing the methods on external samples, and are therefore questionable in terms of generalizability. Besides, although the studies in the review that categorized major depressive disorder all achieved more than 77% accuracy, the poor sensitivity performance of some of the studies hinders the effectiveness of the technique as a screening tool, and may be overly burdensome with excessive false positives.

Moreover, the datasets commonly utilized in depressive disorder research tend to predominantly feature data from European and American populations, which does not adequately represent the diverse manifestations of depressive disorder across different cultures. There is ample evidence that depressive disorder presents unique features in different cultural contexts (59), which makes it problematic to generalize findings from European and American populations to Asia or other developing regions. Recent developments have seen the emergence of depressive disorder datasets reflecting non-Western cultural contexts, such as the Chinese MODMA dataset. However, there remains a scarcity of studies employing multimodal deep learning approaches that integrate visual, auditory, and textual data within these contexts. Additionally, the absence of a standardized and consistent methodology for data collection across various datasets complicates the comparison of results, posing a significant challenge to the feasibility of deploying these technologies more broadly. This lack of comparability undermines the potential for these advanced diagnostic methods to be validated and applied effectively in diverse clinical settings, thus limiting their utility in global mental health assessments.

In multimodal fusion research, the primary focus is split between early feature fusion and late decision fusion. Studies suggest that decision fusion typically performs better than feature fusion using the same algorithms, as it accounts for the redundancy and complementary information across modalities. The simple splicing of features from different modalities often lacks a solid theoretical foundation. Also, the idea of decision fusion, which relies on averaging or voting mechanisms of different modal results, lacks evidence as well. Nevertheless, network hybrid fusion offers a more sophisticated approach, integrating various modal data more effectively and representing a significant advancement in the field.

The limitations of this review must be acknowledged. Firstly, the majority of included studies were published only within the last few years, and the total number of studies meeting the inclusion criteria was relatively small. This can be attributed to the fact that the field is still not fully developed; medical research cycles are lengthy and subject to stringent evaluations, and the findings have not yet been sufficiently substantiated to support publication in a large number of journals. Moreover there is the significant heterogeneity observed across studies. There are substantial variations in aspects such as deep learning algorithms, selection of modalities, partitioning of training sets, methods of multimodality fusion, feature extraction techniques, and predictive metrics used. Most studies report only a few key statistical indicators, which complicates efforts to compare and analyze data effectively. This diversity in methodologies and reported outcomes poses a challenge for synthesizing results and drawing generalized conclusions from the existing body of research.

In terms of data and modal information, although data diversity and information richness have been shown to improve the accuracy of depression detection in most studies, this conclusion has not been verified by meta-analysis. This study focuses on comparing the optimal combinations of multimodal approaches. Future studies may consider analyzing the differences between unimodal and multimodal approaches to address the question of whether multimodal approaches necessarily overcome the limitations of unimodal approaches. At the same time, we also found that different modalities contribute differently to depression detection. This analysis requires a larger sample size and more data, and current research is insufficient to address this issue. The heterogeneity of the models limits this study to a qualitative description. Although all the studies included in this review are based on deep learning methods for depression detection, the rapid development of deep learning has led to significant differences between models, making it difficult to generalize which technical approach is superior. Nevertheless, this paper analyzed the trends in the development of deep learning technologies for depression detection and found that depression detection technologies based on pre-trained models and fine-tuning with big data are currently emerging as a popular direction.

Based on these insights, it is reasonable to assert that deep learning-based audiovisual text depressive disorder screening presents a promising avenue for depressive disorder screening. However, further research within the field of psychiatry is necessary to facilitate the integration of this technique into clinical settings. Currently, research in deep learning for depressive disorder screening is predominantly spearheaded by the field of information science, which often leads to challenges in translating these findings to practical clinical applications due to interdisciplinary differences. While traditional machine learning methods have been explored in depressive disorder screening, insufficient attention has been given to non-invasive multimodal approaches, which hold significant potential for transforming depressive disorder screening. This underscores the importance of a concerted effort to understand, summarize, and provide a overview of the current state of research in this domain, aiming to advance this methodology for rigorous psychiatric research. There is a strong belief that deep learning-based audiovisual multimodal technology will significantly enhance depressive disorder screening practices in the future.

Conclusions

This review indicated that deep learning-based methods incorporating video, audio, and text modalities might provide an effective means for early depressive disorder screening. Given the challenges associated with current depressive disorder prevention and treatment methods, along with the shortage of psychiatrists and psychological professionals, employing such innovative techniques offers a promising solution. To fully realize the potential of these methods in clinical settings, further and more focused research is essential. There is confidence that these techniques will prove valuable in clinical depressive disorder screening, potentially reducing the disease burden and alleviating the social stress associated with depressive disorder.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the Narrative Review reporting checklist. Available at https://amj.amegroups.com/article/view/10.21037/amj-25-6/rc

Peer Review File: Available at https://amj.amegroups.com/article/view/10.21037/amj-25-6/prf

Funding: This work was supported by the Wuhan Medical Research Project (Healthy Development) (WX23A99) and Wuhan Natural Science Foundation Exploration Plan Municipal Medical Institutions Clinical Research Key Project (2024020801020405, 2025020701020282).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://amj.amegroups.com/article/view/10.21037/amj-25-6/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Friedrich MJ. Depression Is the Leading Cause of Disability Around the World. JAMA 2017;317:1517. [Crossref] [PubMed]
Liu Q, He H, Yang J, et al. Changes in the global burden of depression from 1990 to 2017: Findings from the Global Burden of Disease study. J Psychiatr Res 2020;126:134-40. [Crossref] [PubMed]
World Health Organization. Suicide worldwide in 2019: global health estimates.
Hollon SD, Thase ME, Markowitz JC. Treatment and Prevention of Depression. Psychol Sci Public Interest 2002;3:39-77. [Crossref] [PubMed]
Patel V. Mental health in low- and middle-income countries. Br Med Bull 2007;81-82:81-96. [Crossref] [PubMed]
Kraus C, Kadriu B, Lanzenberger R, et al. Prognosis and improved outcomes in major depression: a review. Transl Psychiatry 2019;9:127. [Crossref] [PubMed]
Dozois DJ, Dobson KS, Ahnberg JL. A psychometric evaluation of the beck depression inventory-II. Psychol Assess 1998;10:83.
Zung WW. A self-rating depression scale. Arch Gen Psychiatry 1965;12:63-70. [Crossref] [PubMed]
Hamilton M. The Hamilton rating scale for depression. In: Assessment of depression. Berlin, Heidelberg: Springer; 1986:143-52.
El-Den S, Chen TF, Gan YL, et al. The psychometric properties of depression screening tools in primary healthcare settings: A systematic review. J Affect Disord 2018;225:503-22. [Crossref] [PubMed]
Fried EI, Flake JK, Robinaugh DJ. Revisiting the theoretical and methodological foundations of depression measurement. Nat Rev Psychol 2022;1:358-68. [Crossref] [PubMed]
Hinshaw SP, Stier A. Stigma as related to mental disorders. Annu Rev Clin Psychol 2008;4:367-93. [Crossref] [PubMed]
Guo Y, Liu Y, Oerlemans A, et al. Deep learning for visual understanding: A review. Neurocomputing 2016;187:27-48.
Liu ZT, Xiang CN, Zhong BL, et al. A review of speech-based depression detection research. J Signal Process 2023;39:616-31.
He L, Niu M, Tiwari P, et al. Deep learning for depression recognition with audiovisual cues: A review. Inf Fusion 2022;80:56-86.
Ding Z, Zhou Y, Dai AJ, et al. Speech based suicide risk recognition for crisis intervention hotlines using explainable multi-task learning. J Affect Disord 2025;370:392-400. [Crossref] [PubMed]
Hosseinifard B, Moradi MH, Rostami R. Classifying depression patients and normal subjects using machine learning techniques and nonlinear features from EEG signal. Comput Methods Programs Biomed 2013;109:339-45. [Crossref] [PubMed]
Yasin S, Hussain SA, Aslan S, et al. EEG based Major Depressive disorder and Bipolar disorder detection using Neural Networks:A review. Comput Methods Programs Biomed 2021;202:106007. [Crossref] [PubMed]
Islam MR, Kabir MA, Ahmed A, et al. Depression detection from social network data using machine learning techniques. Health Inf Sci Syst 2018;6:8. [Crossref] [PubMed]
Guntuku SC, Yaden DB, Kern ML, et al. Detecting depression and mental illness on social media: an integrative review. Curr Opin Behav Sci 2017;18:43-9.
Ding Z, Chen J, Zhong BL, et al. Emotional stimulated speech-based assisted early diagnosis of depressive disorders using personality-enhanced deep learning. J Affect Disord 2025;376:177-88. [Crossref] [PubMed]
Dong J, Wei W, Wu K, et al. The application of machine learning in depression. Adv Psychol Sci 2020;28:266.
Page MJ, McKenzie JE, Bossuyt PM, et al. Updating guidance for reporting systematic reviews: development of the PRISMA 2020 statement. J Clin Epidemiol 2021;134:103-12. [Crossref] [PubMed]
Yang L, Jiang D, Xia X, et al. Multimodal measurement of depression using deep learning models. In Mountain View, CA, United States; 2017:53-9. (AVEC 2017 - Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, co-located with MM 2017). Available online: http://dx.doi.org/10.1145/3133944.3133948
Haque A, Guo M, Miner AS, et al. Measuring Depression Symptom Severity from Spoken Language and 3D Facial Expressions. arXiv; 2018 [cited 2024 Mar 21]. Available online: http://arxiv.org/abs/1811.08592
Al Hanai T, Ghassemi M, Glass J. Detecting Depression with Audio/Text Sequence Modeling of Interviews. In: Interspeech 2018. ISCA; 2018:1716-20. Available online: https://www.isca-archive.org/interspeech_2018/alhanai18_interspeech.html
Yang L, Jiang D, Sahli H. Integrating deep and shallow models for multi-modal depression analysis—hybrid architectures. IEEE Trans Affect Comput 2018;12:239-53.
Qureshi SA, Saha S, Hasanuzzaman M, et al. Multitask Representation Learning for Multimodal Estimation of Depression Level. IEEE Intell Syst 2019;34:45-52.
Rohanian M, Hough J, Purver M. Detecting depression with word-level multimodal fusion. In Graz, Austria; 2019:1443-7. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; vols. 2019-September). Available online: http://dx.doi.org/10.21437/Interspeech.2019-2283
Lam G, Dongyan H, Lin W. Context-aware Deep Learning for Multi-modal Depression Detection. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, United Kingdom: IEEE; 2019:3946-50. Available online: https://ieeexplore.ieee.org/document/8683027/
Ray A, Kumar S, Reddy R, et al. Multi-level Attention Network using Text, Audio and Video for Depression Prediction. In: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. Nice France: ACM; 2019:81-8. Available online: https://dl.acm.org/doi/10.1145/3347320.3357697
Makiuchi MR, Warnita T, Uto K, et al. Speech-linguistic multimodal representation for depression severity assessment. J Inf Process 2019;2019:1-4.
Niu M, Tao J, Liu B, et al. Multimodal Spatiotemporal Representation for Automatic Depression Level Detection. IEEE Trans Affect Comput 2020;14:294-307.
Lin L, Chen X, Shen Y, et al. Towards automatic depression detection: A BiLSTM/1D CNN-based model. Applied Sciences 2020;10:8701.
Zhang Z, Lin W, Liu M, et al. Multimodal Deep Learning Framework for Mental Disorder Recognition. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). Buenos Aires, Argentina: IEEE; 2020:344-50. Available online: https://ieeexplore.ieee.org/document/9320154/
Xiao J, Huang Y, Zhang G, et al. A Deep Learning Method on Audio and Text Sequences for Automatic Depression Detection. In: 2021 3rd International Conference on Applied Machine Learning (ICAML). Changsha, China: IEEE; 2021:388-92. Available online: https://ieeexplore.ieee.org/document/9712128/
Solieman H, Pustozerov EA. The Detection of Depression Using Multimodal Models Based on Text and Voice Quality Features. In: 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus). St. Petersburg, Moscow, Russia: IEEE; 2021:1843-8. Available online: https://ieeexplore.ieee.org/document/9396540/
Oureshi SA, Dias G, Saha S, et al. Gender-Aware Estimation of Depression Severity Level in a Multimodal Setting. In: 2021 International Joint Conference on Neural Networks (IJCNN). Shenzhen, China: IEEE; 2021:1-8. Available online: https://ieeexplore.ieee.org/document/9534330/
Sun H, Liu J, Chai S, et al. Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level. Sensors (Basel) 2021;21:4764. [Crossref] [PubMed]
Sun G, Zhao S, Zou B, et al. Multimodal depression detection using a deep feature fusion network. In: Lu Y, Cheng C. editors. Third International Conference on Computer Science and Communication Technology (ICCSCT 2022). Beijing, China: SPIE; 2022:269. Available online: https://www.spiedigitallibrary.org/conference-proceedings-of-spie/12506/2662620/Multimodal-depression-detection-using-a-deep-feature-fusion-network/10.1117/12.2662620.full
Guo Y, Zhu C, Hao S, et al. Automatic Depression Detection via Learning and Fusing Features From Visual Cues. IEEE Trans Comput Soc Syst 2022;1-8.
Chen Y, Liu C, Du Y, et al. Machine learning classification model using Weibo users’ social appearance anxiety. Personal Individ Differ 2022;188:111449.
Rasipuram S, Bhat JH, Maitra A, et al. Multimodal Depression Detection Using Task-oriented Transformer-based Embedding. In: 2022 IEEE Symposium on Computers and Communications (ISCC). Rhodes, Greece: IEEE; 2022:1-4. Available online: https://ieeexplore.ieee.org/document/9913044/
Fang M, Peng S, Liang Y, et al. A multimodal fusion model with multi-level attention mechanism for depression detection. Biomed Signal Process Control 2023;82:104561.
Saggu GS, Gupta K, Arya K, et al. DepressNet: A multimodal hierarchical attention mechanism approach for depression detection. Int J Eng Sci 2022;15:24-32.
Tasnim M, Ehghaghi M, Diep B, et al. Depac: a corpus for depression and anxiety detection from speech. ArXiv Prepr ArXiv230612443. 2023.
Aloshban N, Esposito A, Vinciarelli A. What You Say or How You Say It? Depression Detection Through Joint Modeling of Linguistic and Acoustic Aspects of Speech. Cogn Comput 2022;14:1585-98.
Uddin MZ. Depression detection in text using long short-term memory-based neural structured learning. In Chittagong, Bangladesh; 2022:408-14 (2022 International Conference on Innovations in Science, Engineering and Technology, ICISET 2022). Available online: http://dx.doi.org/10.1109/ICISET54810.2022.9775893
Huang X. Ideal construction of chatbot based on intelligent depression detection techniques. In Changchun, China; 2022:511-5. (2022 IEEE International Conference on Electrical Engineering, Big Data and Algorithms, EEBDA 2022). Available online: http://dx.doi.org/10.1109/EEBDA53927.2022.9744938
Gratch J, Artstein R, Lucas GM, et al. The distress analysis interview corpus of human and computer interviews. In: LREC. Reykjavik; 2014:3123-8.
DeVault D, Artstein R, Benn G, et al. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In: Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems; 2014:1061-8.
Valstar M, Schuller B, Smith K, et al. AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. Barcelona Spain: ACM; 2013:3-10. Available online: https://dl.acm.org/doi/10.1145/2512530.2512533
Valstar M, Schuller B, Smith K, et al. AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. Orlando Florida USA: ACM; 2014:3-10. Available online: https://dl.acm.org/doi/10.1145/2661806.2661807
Wu P, Wang R, Lin H, et al. Automatic depression recognition by intelligent speech signal processing: A systematic survey. CAAI Trans Intell Technol 2022. Available online: https://doi.org/10.1049/cit2.12113
Sarker IH. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions. SN Comput Sci 2021;2:420. [Crossref] [PubMed]
Tang J, Alelyani S, Liu H. Feature selection for classification: A review. Data Classif Algorithms Appl 2014;37.
Cummins N, Sethu V, Epps J, et al. Analysis of acoustic space variability in speech affected by depression. Speech Commun 2015;75:27-49.
Sun JW, Fan R, Wang Q, et al. Identify abnormal functional connectivity of resting state networks in Autism spectrum disorder and apply to machine learning-based classification. Brain Res 2021;1757:147299. [Crossref] [PubMed]
Kessler RC, Bromet EJ. The epidemiology of depression across cultures. Annu Rev Public Health 2013;34:119-38. [Crossref] [PubMed]
Gaebel W, Wölwer W. Facial expression and emotional face recognition in schizophrenia and depression. Eur Arch Psychiatry Clin Neurosci 1992;242:46-52. [Crossref] [PubMed]
Zhang S, Zhang X, Zhao X, et al. MTDAN: A lightweight multi-scale temporal difference attention networks for automated video depression detection. IEEE Trans Affect Comput 2023;15:1078-89.
Goldman LS, Nielsen NH, Champion HC. Awareness, diagnosis, and treatment of depression. J Gen Intern Med 1999;14:569-80. [Crossref] [PubMed]
Poria S, Cambria E, Howard N, et al. Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 2016;174:50-9.
Ross AA, Govindarajan R. Feature level fusion of hand and face biometrics. In: Biometric technology for human identification II. SPIE; 2005:196-204.
Zeng Z, Pantic M, Roisman GI, et al. A survey of affect recognition methods: Audio, visual and spontaneous expressions. In: Proceedings of the 9th international Conference on Multimodal Interfaces; 2007:126-33.
Baldessarini RJ, Finklestein S, Arana GW. The predictive power of diagnostic tests and the effect of prevalence of illness. Arch Gen Psychiatry 1983;40:569-73. [Crossref] [PubMed]
Baratloo A, Hosseini M, Negida A, et al. Part 1: Simple Definition and Calculation of Accuracy, Sensitivity and Specificity. Emerg (Tehran) 2015;3:48-9.
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 2020;21:6. [Crossref] [PubMed]
Willmott CJ, Matsuura K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim Res 2005;30:79-82.
McGorry PD. Early intervention in psychosis: obvious, effective, overdue. J Nerv Ment Dis 2015;203:310-8. [Crossref] [PubMed]
Hsiao JK, Bartko JJ, Potter WZ. Diagnosing diagnoses: Receiver operating characteristic methods and psychiatry. Arch Gen Psychiatry 1989;46:664-7. [Crossref] [PubMed]
Bleeker SE, Moll HA, Steyerberg EW, et al. External validation is necessary in prediction research: a clinical example. J Clin Epidemiol 2003;56:826-32. [Crossref] [PubMed]
Stein DJ, Shoptaw SJ, Vigo DV, et al. Psychiatric diagnosis and treatment in the 21st century: paradigm shifts versus incremental integration. World Psychiatry 2022;21:393-414. [Crossref] [PubMed]
Cabitza F, Campagner A, Soares F, et al. The importance of being external. methodological insights for the external validation of machine learning models in medicine. Comput Methods Programs Biomed 2021;208:106288. [Crossref] [PubMed]

doi: 10.21037/amj-25-6
Cite this article as: Ding Z, Xu YM, Liu CL, Dai AJ, Liu ZT, Zhong BL. Deep learning-based screening for depressive disorders using audio-visual text multimodal information: a narrative review. AME Med J 2026;11:18.

Deep learning-based screening for depressive disorders using audio-visual text multimodal information: a narrative review

Introduction

Background

Rationale and knowledge gap

Objective

Methods

Table 1

Results

Table 2

Deep learning methods

Features of modalities

Multi-modality fusion approaches

Classification of major depressive disorder

Table 3

Prediction of depression severity

Table 4

Table 5

Table 6

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share