Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (2024)

Renjie Liang
National University Singapore
liangrj5@gmail.com Yiming Yang
National University Singapore
e0920761@u.nus.edu Hui Lu
Nanyang Technological University
hui007@ntu.edu.sg Li Li
National University Singapore
lili02@u.nus.edu

Abstract

Temporal Sentence Grounding in Videos (TSGV) aims to retrieve the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy, suffering from inefficiency and heaviness. Previous attempts to address this issue have primarily concentrated on feature fusion layers. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from multiple networks.Specifically, We first unify different outputs of the different models. Next, a Knowledge Aggregation Unit (KAU) is built to acquire high-quality integrated soft labels from multiple teachers. Additionally, we propose a Shared Encoder strategy to enhance the learning of shallow layers. Extensive experimental results on three popular TSGV benchmarks demonstrate that our method is both effective and efficient. Our code is available at https://github.com/renjie-liang/EMET.

1 Introduction

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (1)

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (2)

Temporal Sentence Grounding in Videos (TSGV), which aims to ground a temporal segment in an untrimmed video with a natural language query, has drawn widespread attention over the past few years [24]. There is a clear trend that top-performing models are becoming larger with numerous parameters.Additionally, the recent work shows that accuracy in TSGV tasks has reached a bottleneck period, while the combination of complex networks and multiple structures is becoming more prevalent to further improve the ability of the model,which will cause an expansion of the model size. However, the heavy resource cost required by the approaches restricts their applications.

In order to improve efficiency, FMVR [5] and CCA [16] are proposed to construct fast TSGV models by reducing the fusion time.However, they only decline the inferred time significantly, the whole network is still time-consuming, as depicted in Figure 1(a).Another shortage is that FMVR and CCA are required to store the feature after the encoder with extra storage.

To tackle this challenge, we extend the accelerated span to the entire network, shown in Figure 1(a).A natural approach is to reduce the complexity of the network, which can involve decreasing the hidden dimension, reducing the number of layers, and eliminating auxiliary losses. Nevertheless, all of these methods will lead to a decrease in performance to some extent.One promising technique is knowledge distillation [7] to mitigate the decrease in performance and maintain high levels of accuracy when lighting the network.

Initially, knowledge distillation employed a single teacher, but as technology advanced, multiple teachers have been deemed beneficial for imparting greater knowledge[4], as extensively corroborated in other domains[15].Multi-teacher strategy implies that there is a more diverse range of dark knowledge to be learned, with the optimal knowledge being more likely to be present [18].Thus far, multiple-teacher knowledge distillation has not been studied and exploited for the TSGV task.

An immediate problem is that different models will produce heterogeneous output, e.g., candidates for proposed methods, or probability distribution for proposal-free methods. A question is how to identify optimal knowledge from multiple teachers.In addition, knowledge was hardly backpropagted to the front layers from the soft label in the last layers [13], meaning that the front part of the student model usually hardly enjoys the benefit of teachers’ knowledge.Until now, here are three issues we need to deal with: i) how to unify knowledge from the different models, ii) how to select the optimal knowledge and assign weights among these teachers, and iii) how the front layers of the student benefit from the teachers.

Several challenges arise: Firstly, different models may yield heterogeneous outputs such as varying candidate spans or probability distributions.Secondly, backpropagating knowledge is difficult from the last to the front layers, limiting the student shallow layer to learn from teacher models [13].

For the first challenge, we unify the heterogeneity of model outputs by converting them into a unified 1D probability distribution. This enables us to seamlessly integrate the knowledge during the model training.The 1D distribution, derived from the span-based method in the proposal-free catalog, offers a speed advantage over proposal-based methods.

To integrate knowledge from various models, we have developed a Knowledge Aggregation Unit (KAU). This unit utilizes multi-scale information [9] to derive a higher-quality target distribution, moving beyond mere averaging of probabilities. The KAU adaptively assigns weights to different teacher models. This approach overcomes the limitations of manually tuning weights, a sensitive hyperparameter in multi-teacher distillation [10].

Regarding the second challenge, we implemented a shared layer strategy to facilitate the transfer of shallow knowledge from the teacher to the student model. This involves co-training a teacher model with our student model, sharing encoder layers and aligning hidden states. Such an arrangement ensures comprehensive and global knowledge acquisition by the student model.

During inference, we only exploit the student model to perform inference, which does not add computational overhead.To sum up, this paper’s primary contributions can be distilled into three main points, which are outlined below:

•
We propose a multi-teacher knowledge distillation framework for the TSGV task. This approach substantially reduces the time consumed and significantly decreases the number of parameters, while still maintaining high levels of accuracy.
•
To enable the whole student to benefit from various teacher models, we unify the knowledge from different models and use the KAU module to adaptively integrate to a single soft label. Additionally, a shared encoder strategy is utilized to share knowledge from the teacher model in front layers.
•
Extensive experimental results on three popular TSGV benchmarks demonstrate that our proposed method performs superior to the state-of-the-art methods and has the highest speed and minimal parameters and computation.

2 Related Work

Given an untrimmed video, temporal sentence grounding in videos (TSGV) is to retrieve a video segment according to a query, which is also known as Video Moment Retrieval (VMR).Existing solutions to video grounding are roughly categorized into proposal-based and proposal-free frameworks. We also introduce some works on fast video temporal grounding as follows.

2.1 Proposal-based Methods

The majority of proposal-based approaches rely on some carefully thought-out dense sample strategies, which gather a set of video segments as candidate proposals and rank them in accordance with the scores obtained for the similarity between the proposals and the query to choose the most compatible pairs.Zhang etal. [25] convert visual features into a 2D temporal map and encode the query in sentence-level representation, which is the first solution to model proposals with a 2D temporal map (2D-TAN).BAN-APR [3] utilize a boundary-aware feature enhancement module to enhance the proposal feature with its boundary information by imposing a new temporal difference loss. Currently, most proposal-based methods are time-consuming due to the large number of proposal-query interactions.

2.2 Proposal-free Methods

Actually, the caliber of the sampled proposals has a significant impact on the impressive performance obtained by proposal-based methods. To avoid incurring the additional computational costs associated with the production of proposal features, proposal-free approaches directly regress or forecast the beginning and end times of the target moment.VSLNet Zhang etal. [22] exploits context-query attention modified from QANet Yu etal. [19] to perform fine-grained multimodal interaction. Then a conditioned span predictor computes the probabilities of the start/end boundaries of the target moment.SeqPAN [23] design a self-guided parallel attention module to effectively capture self-modal contexts and cross-modal attentive information between video and text inspired by sequence labeling tasks in natural language processing.Yang and Wu [17] propose Entity-Aware and Motion-Aware Transformers (EAMAT) that progressively localize actions in videos by first coarsely locating clips with entity queries and then finely predicting exact boundaries in a shrunken temporal region with motion queries.Nevertheless, with the improvement of performance, huge and complex architectures inevitably result in higher computational cost during inference phase.

2.3 Fast Video Temporal Grounding

Recently, fast video temporal grounding has been proposed for more practical applications.According to [5], the standard TSGV pipeline can be divided into three components. The visual encoder and the text encoder are proved to have little influence in model testing due to the features pre-extracted and stored at the beginning of the test, and cross-modal interaction is the key to reducing the test time. Thus, a fine-grained semantic distillation framework is utilized to leverage semantic information for improving performance. Besides, Wu etal. [16] utilize commonsense knowledge to obtain bridged visual and text representations, promoting each other in common space learning.However, based on our previous analysis, the inferred time proposed by [5] is only a part of the entire prediction processing. The processing from inputting video features to predicting timestamps is still time-consuming.

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (3)

3 Methodology

In this section, we first give a brief task definition of TSGV in Section 3.1. In the following, heterogeneous knowledge unification is presented as a prerequisite in Section 3.2.1. Then we introduce the student network (Section 3.2.2), shared encoder strategy (Section 3.2.5) and knowledge aggregation unit (Section 3.2.4) as shown in 2.

Finally, the training and inference processes are presented in section 3.3, as well as the loss settings.

3.1 Problem Formulation

Given an untrimmed video $V=[f_{t}]_{t=1}^{T}$ and the language query $Q=[q_{j}]_{j=1}^{m}$ , where $T$ and $m$ are the numbers of frames and words, respectively. The start and end times of the ground truth moment are indicated by $\tau^{s}$ and $\tau^{e}$ , $1\leq\tau_{s}<\tau_{e}\leq T$ .Mathematically, TSGV is to retrieve the target moment starting from $\tau_{s}$ and ending at $\tau_{e}$ by giving a video $V$ and query $Q$ , i.e., $\mathcal{F}_{TSGV}:(V,Q)\mapsto(\tau_{s},\tau_{e})$ .

3.2 General Scheme

3.2.1 Heterogeneous Knowledge Unification

Compared to the proposal-based method, the span-based method doesn’t need to generate redundant proposals, which is an inherent advantage in terms of efficiency. Meanwhile, 1D distribution carries more knowledge than the regression-based method.Hence we unify various heterogeneous outputs into 1D probability distribution and develop our network based on the span-based method, as shown in Figure 2. The outputs of the span-based method are the 1D probability distributions of start and end moments, denoted as $P_{s},P_{e}\in\mathbb{R}^{n}$ . To keep concise, we adopt $P\in\mathbb{R}^{2n}$ without subscripts to express stacked probability for the start and end moments.

We simply adopt the softmax function to the outputs of the span-based methods and obtain probability distributions.

\displaystyle P_{s}=Softmax(P^{\prime}_{s})\quad P_{e}=Softmax(P^{\prime}_{e})

(1)

2D-map anchor-based method is a common branch of the proposal-base method, such as [25], [3]. A 2D map $S=[s_{i,j}]\in\mathbb{R}^{n\times n}$ is generatedto model temporal relations between proposal candidates, on which one dimension indicates the start moment and the other indicates the end moment. We calculate the max scores of $S$ by row/column as start/end distributions.

	$\displaystyle P_{s}=$	$\displaystyle Softmax(\max\limits_{j}s_{i,j})$		(2)
	$\displaystyle P_{e}=$	$\displaystyle Softmax(\max\limits_{i}s_{i,j})$		(2)

As for the regression-based method, we can get a time pair $(t_{s},t_{e})$ after computation. Then the Gaussian distribution is leveraged to simulate the probability distribution of the start/end moments as follows:

	$\displaystyle P_{s}=$	$\displaystyle Softmax({N(t_{s},\sigma^{2})})$		(3)
	$\displaystyle P_{e}=$	$\displaystyle Softmax({N(t_{e},\sigma^{2})})$		(3)

The proposal-generated method will generate a triple candidate list $S^{\prime}=(t_{s}^{i},t_{e}^{i},r^{i})\in\mathbb{R}^{3\times k}$ , where $k$ is the number of proposal candidates. Similarly, we use the Gaussian distribution to generate the probability distribution of the start/end moment for each candidate. Then we put different weights on various candidates and accumulate them:

	$\displaystyle P_{s}=$	$\displaystyle Softmax(\sum_{i}r^{i}{N(t_{s}^{i},\sigma^{2})})$		(4)
	$\displaystyle P_{e}=$	$\displaystyle Softmax(\sum_{i}r^{i}{N(t_{e}^{i},\sigma^{2})})$		(4)

where $\sigma^{2}$ is the variance of Gaussian distribution $N$ .

3.2.2 Student Network

The design of the student network emphasizes efficient processing. The video feature, represented as $\bm{V}\in\mathbb{R}^{n\times d_{v}}$ , utilizes the I3D framework, as proposed by Carreira et al. [2] and further refined by Zhang [23].In parallel, the query feature, denoted by $\bm{Q}\in\mathbb{R}^{m\times d_{q}}$ , is initialized using GloVe embeddings, with $n$ indicating the length of the features.Both the video and query features undergo projection and encoding processes, which align their dimensions to facilitate uniformity and interoperability in subsequent computational stages.

	$\displaystyle\bm{V^{\prime}}$	$\displaystyle=\mathtt{VisualEncoder}(\bm{V})$		(5)
	$\displaystyle\bm{Q^{\prime}}$	$\displaystyle=\mathtt{QueryEncoder}(\bm{Q})$		(5)

Subsequently, a lightweight transformer is employed to fuse the video and query features. This approach, in contrast to the direct dot multiplication method referenced in Gao et al. [5], significantly enhances performance without imposing excessive computational load on the model.

\bm{V}^{qv}=\mathtt{Transformer}(\bm{V^{\prime}},\bm{Q^{\prime}})

(6)

Finally, a predictor is utilized to generate the logits corresponding to the start and end points. Then multiply $\bm{P_{s}}$ and $\bm{P_{e}}$ to form a matrix, within which the highest value is identified. The row and column indices of this peak value correspond to the predicted start and end indices, respectively.

\displaystyle(\bm{P_{s}},\bm{P_{e}})=\mathtt{Predictor}(\bm{V^{qv}})\in\mathbb%{R}^{n}

(7)

3.2.3 Teacher Network

The teacher network is architected with a focus on performance optimization, in contrast to the student network which is streamlined for efficiency. The specific differences between the two networks are comprehensively itemized in Table 1. Cross-modal fusion layers have been incorporated to augment the interactivity among disparate modalities within the teacher model.Following the Encoder stage, a group of transformations is employed to facilitate the amalgamation of query and video features. Subsequently, these cross-modal features are fed into the fusion layer.

	$\displaystyle\bm{V^{\prime}}$	$\displaystyle=\mathtt{Transformer}(\bm{V^{\prime}},\bm{Q^{\prime}})$		(8)
	$\displaystyle\bm{Q^{\prime}}$	$\displaystyle=\mathtt{Transformer}(\bm{Q^{\prime}},\bm{V^{\prime}})$		(8)

ModuleLayerTeacherStudentProjectorCONV1D_query11CONV1D_video44Fusion LayerTransformer_query40Transformer_video40Transformer_fusion11PredictorCONV1D_start63CONV1D_end63

MethodYearCharades-STAActivityNetTACoSFLOPS (B)Params (M)Times (ms)sumACCFLOPS (B)Params (M)Times (ms)sumACCFLOPS (B)Params (M)Times (ms)sumACCSCDM201916.500012.8800-87.87260.230015.6500-56.61260.230015.6500--2D-TAN202052.261669.060613.342566.051067.900082.440077.990371.431067.900082.440077.9903*36.82VSLNet20200.03000.78288.002077.500.05210.80058.989369.380.06300.80058.989344.30SeqPAN20210.02091.186310.5168102.200.02141.214313.713873.870.02181.235923.302567.71EMB20220.08852.216822.390097.580.20336.151525.087170.880.28172.217223.634960.36EAMAT20221.288194.121556.1753103.654.154593.0637125.782260.944.154593.0637125.782264.98BAN-APR20229.452734.649119.9767105.9625.468845.671444.858777.7925.468845.671444.8587*52.10CPL20223.44445.375726.845171.633.89297.011526.442349.14----CNM20220.52605.37115.448250.100.50637.00744.8629*48.96----FVMR2021---88.75---71.85----CCA2022137.298479.767126.973489.41151.102322.570931.5400*75.95151.102322.570931.540050.90EMTM (Ours)0.00810.65694.799892.800.00840.68483.543170.910.00870.70654.573758.24

3.2.4 Knowledge Aggregation Unit

Our goal is to combine all the unified predictions from $b$ branches to establish a strong teacher distribution.Drawing inspiration from [9], we developed the Knowledge Aggregation Unit (KAU). The KAU integrates parallel transformations with varying receptive fields, harnessing both local and global contexts to derive a more accurate target probability. The KAU’s design is illustrated in Figure 3.

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (4)

Considering saving more original information, we first take the video features $V^{\prime}$ in the eq. (5) as input and then add convolution layers. The convolution operation is conducted with a small kernel size of 3 initially and then consistently increased to 5 and 7.Further, we incorporate the average pooling of query features $Q^{\prime}$ in the eq. (5) for richer representations. Then we concatenate all the splits and obtain the intermediate vector $v$ , denoted as:

\displaystyle v=[q_{avg},g([v_{conv3},v_{conv5},v_{conv7}])]

(9)

where $q_{avg}$ denotes the result of $Q^{\prime}$ after average pooling, g(·) denotes the global pooling function, $v_{conv3}$ , $v_{conv5}$ , and $v_{conv7}$ denote the results after the convolution layers with kernel size 3, 5, and 7, respectively.

Passing through a fully connected layer $FC$ , a channel-wise softmax operator is applied to obtain the soft attention $a$ .

\displaystyle a=Softmax(\mathtt{FC}(v))\in\mathbb{R}^{2b\times n}

(10)

where the $b$ denote the number of teacher branch, $2b$ is because there are two probability distributions (i.e., start and end).

Finally, we fuse prediction results from multiple branches via an element-wise summation to obtain the weighted ensemble probability.

\displaystyle\widetilde{P}=\sum_{i=1}^{b}{a^{i}\otimes\hat{P}^{i}}\in\mathbb{R%}^{2\times n}

(11)

where $\widetilde{P}$ denotes the ensemble probability, $P^{i}\in\mathbb{R}^{2\times n}$ means the start and end distribution from $i$ -th teacher branch, and $\otimes$ refers to the channel-wise multiplication. Our experiments (see Section 4.5.1) prove that the weights generated by KAU can achieve better distillation performance.

3.2.5 Shared Encoder Strategy

The backpropagation of knowledge from soft labels often provides limited benefit to the shallow layers, primarily due to the influence of non-linear activation functions and dropout mechanisms. However, the concept of feature invariance in these layers, in Zeiler et al.[21], guides our approach. We propose the sharing of several shallow layers between the student and teacher networks. This collaborative training strategy enables the shallow layers in the student network to assimilate additional knowledge from the teacher network, enhancing their learning capacity.

Specifically, a student and a teacher share their text and query encoder, shown in Figure 2. The encoder consists of several conv1D in our network, which is lightweight and fast due to its inherent characteristics.The $\mathtt{VisualEncoder}$ , $\mathtt{QueryEncoder}$ in eq. (5) denotethe shared layers in our network.

3.3 Training and Inference

3.3.1 TSGV Loss

The overall training loss of our model is described as follows. For the student and the teacher, the hard loss (i.e. label loss) is used to optimize distributions of start/end boundaries.

	$\displaystyle L^{st}_{loc}=f_{CE}(P^{st},Y)$		(12)
	$\displaystyle L^{tc}_{loc}=f_{CE}(P^{tc},Y)$		(12)

where $f_{CE}$ is the cross-entropy function, and $Y$ is one-hot labels for the start and end boundaries of ground truth. Similarly, we encourage ensemble probability to get closer to ground truth distribution.

\displaystyle L^{ens}_{loc}=f_{CE}(\widetilde{P},Y)

(13)

As we discussed previously, the learned ensemble information serves as complementary cues to provide an enhanced supervisory signal to our student model. As a result, we introduce multiple distillation learning, which transfers the rich knowledge in the form of softened labels. The formulation is given by:

\displaystyle L_{dis}=f_{KL}(softmax(P^{st},t),softmax(\widetilde{P},t)

(14)

where $f_{KL}$ represents the KL divergence. The $t$ is the temperature in knowledge distillation, which control the smoothness of the output distribution.

Based on the above design, the overall objective for a training video-query pair is formulated as:

\displaystyle L=L^{st}_{loc}+L^{tc}_{loc}+L^{ens}_{loc}+\alpha L_{dis}

(15)

where $\alpha$ is a balance term.

3.3.2 Inference

The teacher and student models will be collaboratively trained, while we only adopt the student model for TSGV during testing. The learned rich information serves as complementary cues to provide an enhanced supervisory signal to the TSGV model.Compared with FMVR [5] and CCA [16], we won’t pre-calculate and store visual features.

MethodCharades-STAActivityNetTACoSR1@0.3R1@0.5R1@0.7mIoUR1@0.3R1@0.5R1@0.7mIoUR1@0.3R1@0.5R1@0.7mIoUSCDM-54.4433.43-54.8036.7519.86-26.1121.17--2D-TAN-42.8023.25-58.7544.0527.38-35.1725.1711.6524.16VSLNet64.3047.3130.1945.1563.1643.2226.1643.1929.6124.2720.0324.11SeqPAN73.8460.8641.3453.9261.6545.5028.3745.1148.6439.6428.0737.17EMB72.5058.3339.2553.0964.1344.8126.0745.5950.4637.8222.5435.49EAMAT74.1961.6941.9654.4555.3338.0722.8740.1250.1138.1626.8236.43BAN-APR*74.0563.6842.28*54.15*65.1148.1229.67*45.8748.2433.74*17.44*32.95CPL66.4049.2422.3943.4855.7331.3712.3236.82----CNM60.0435.1514.95-55.6833.33*12.81* 36.15----FVMR-55.0133.74-60.6345.0026.85-41.4829.12--CCA70.4654.1935.2250.0261.9946.5829.37*45.1145.3032.8318.07-EMTM (Ours)72.7057.9139.8053.0063.2044.7326.0845.3345.7834.8323.4134.44 $\Delta_{SOTA}$ $\uparrow$ 2.24 $\uparrow$ 2.90 $\uparrow$ 4.58 $\uparrow$ 2.98 $\uparrow$ 1.21 $\downarrow$ 1.85 $\downarrow$ 3.29 $\uparrow$ 0.22 $\uparrow$ 0.48 $\uparrow$ 2.42 $\uparrow$ 5.34-

4 Experiments

4.1 Datasets

To evaluate the performance of TSGV, we conduct experiments on three challenging datasets, all the queries in these datasets are in English. Details of these datasets are shown as follows:

Charades-STA[6] is composed of daily indoor activities videos, which is based on Charades dataset[14]. This dataset contains 6672 videos, 16,128 annotations, and 11,767 moments. The average length of each video is 30 seconds. $12,408$ and $3,720$ moment annotations are labeled for training and testing, respectively;

ActivityNet Caption[1] is originally constructed for dense video captioning, which contains about $20$ k YouTube videos with an average length of 120 seconds. As a dual task of dense video captioning, TSGV utilizes the sentence description as a query and outputs the temporal boundary of each sentence description.

TACoS[12] is collected from MPII Cooking dataset[12], which has 127 videos with an average length of $286.59$ seconds.

4.2 Evaluation Metrics

Following existing video grounding works, we evaluate the performance on two main metrics:

mIoU:“mIoU" is the average predicted Intersection over Union in all testing samples. The mIoU metric is particularly challenging for short video moments;

Recall:We adopt “ $\textrm{R@}n,\textrm{IoU}=\mu$ ” as the evaluation metrics, following[6]. The “ $\textrm{R@}n,\textrm{IoU}=\mu$ ” represents the percentage of language queries having at least one result whose IoU between top- $n$ predictions with ground truth is larger than $\mu$ . In our experiments, we reported the results of $n=1$ and $\mu\in\{0.3,0.5,0.7\}$ .

The Metric of Efficiency:Time, FLOPs, and Params are used to measure the efficiency of the model.Specifically, the time refers to the entire inferring time from the input of the video and query pair to the output of the prediction. FLOPs refers to floating point operations, which is used to measure the complexity of the model. Params refers to the model parameter size except the word embedding.

4.3 Implementation Details

For language query $Q$ , we use the $300$ -D GloVe[11] vectors to initialize each lowercase word, which are fixed during training. Following the previous methods, 3D convolutional features (I3D) are extracted to encode videos.We set the dimension of all the hidden layers as $128$ , the kernel size of the convolutional layer as $7$ , and the head size of multi-head attention as $8$ in our model.For all datasets, models are trained for $100$ epochs. The batch size is set to $16$ . The dropout rate is set as 0.2.Besides, an early stopping strategy is adopted to prevent overfitting.The whole framework is trained by Adam optimizer with an initial learning rate of 0.0001. The loss weight $\alpha$ is set as 0.1 in all the datasets. The temperate was set to 1, 3, 3 on Charades-STA, ActivityNet, and TACoS.The pre-trained teacher models are selected in SeqPAN, BAN-APR, EAMAT, and CCA.More ablation studies can be found in Section4.5. All experiments are conducted on an NVIDIA RTX A5000 GPU with 24GB memory. All experiments were performed three times, and reporting the average of performance.

4.4 Comparison with State-of-the-art Methods

We strive to gather the most current approaches, and compare our proposed model with the following state-of-the-art baselines on three benchmark datasets:

•
Proposal-based Methods: SCDM [20], 2D-TAN [25], BAN-APR [3].
•
Proposal-free Methods: VSLNet [22], SeqPAN [23], EMB [8], EAMAT [17].
•
Weakly Supervised Methods: CPL [26], CNM [26]
•
Fast Methods: FVMR [5], CCA [16]

The best performance is highlighted in bold and the second-best is highlighted with underline in tables.

Overall Efficiency-Accuracy Analysis

Considering that fast TSGV task pays the same attention to efficiency as accuracy, we evaluate FLOPs, Params, and Times for each model. For a fair comparison, the batch size is set to 1 for all methods during inference. Besides, we also calculate the sum of the accuracy in terms of “R1@0.3” and “R1@0.5”, named sumACC to evaluate the whole performance of each model.

As Table 2 shows, our method surpasses all other methods and achieves the highest speed, minimal FLOPs and Params on all three datasets.We note that EMTM is at least 2000 times fewer in FLOPs than state-of-the-art proposal-based models (SCDM and 2D-TAN). According to sumACC, EMTM outperforms these two models by gains of at most 26.75% on Charades-STA and 14.30% on ActivtyNet.Despite the parameter size of VSLNet is at the same level as our method, we outperform it significantly in terms of accuracy, achieveing 15.30% absolute improvement by “sumACC” on Charades-STA.When it comes to CCA, which is proposed for fast TSGV, EMTM outperforms 16950x fewer in FLOPs and 121x fewer in model parameter size on Charades-STA.The above comparison illustrates that our method has significant efficiency and accuracy advantages.

Accuracy Analysis

As shown in Table 3, we can observe that our method performs better than extensive methods in most metrics on three benchmark datasets. Compared with FVMR and CCA, our model performers better in all metrics. Especially, EMTM achieves an absolute improvement of 4.58% on Charades- STA and 5.34% on TACoS on the metric "R@1, IoU=0.7", which is a more crucial criterion with higher quality.

The performance of our model on ActivityNet is slightly lower than CCA. The supposed reason may be that ActivityNet is more challenging since it covers a wide range of videos, not limited to daily indoor activities videos in Charades-STA and cooking videos in TACoS. In such cases, label distillation may not effectively capture the key features of the dataset, resulting in limited performance gains. Besides, the effectiveness of label distillation often depends on the performance of the teacher model. The backbone we use is SeqPAN, which has relatively poor performance on ActivityNet, limiting our upper bound even with label distillation. However, our framework can be adapted to any VMR model. If we replace it with the more powerful upcoming backbone models, the performance will surpass the current version.

MethodShared EncoderLabel Distillation R1@0.3 R1@0.5 R1@0.7 mIoUEMTM w/o SE-LD×× $70.19_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.99%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.97}}$ $56.23_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-1.01%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.62}}$ $36.49_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.74%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.39}}$ $51.34_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-1.06%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.98}}$ EMTM w/o SE×✓ $\textbf{73.33}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-1.34}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.84}}$ $\textbf{58.05}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.25}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.26}}$ $\underline{38.36}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.21}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.17}}$ $\textbf{53.31}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.91}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.54}}$ EMTM w/o LD✓× $72.62_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.52%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.69}}$ $56.51_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.84%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+1.18}}$ $37.54_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.50%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.85}}$ $52.39_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.37%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.60}}$ EMTM✓✓ $\underline{72.70}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.55}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.47}}$ $\underline{57.91}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.65}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.75}}$ $\textbf{39.80}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.12}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.12}}$ $\underline{53.00}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.33}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.21}}$

MethodShared EncoderLabel Distillation R1@0.3 R1@0.5 R1@0.7 mIoUEMTM w/o SE-LD×× $62.06_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.85%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.99}}$ $43.90_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.39%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.25}}$ $25.63_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.13%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.07}}$ $44.52_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.38%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.55}}$ EMTM w/o SE×✓ $\underline{63.19}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.22}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.35}}$ $44.11_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.27%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.26}}$ $25.74_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.32%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.41}}$ $45.15_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.03%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.03}}$ EMTM w/o LD✓× $62.98_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.40%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.33}}$ $\textbf{44.68}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.15}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.19}}$ $\textbf{26.10}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.06}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.12}}$ $\underline{45.22}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.12}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.11}}$ EMTM✓✓ $\textbf{63.20}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.58}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.30}}$ $\underline{44.73}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.33}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.58}}$ $\underline{26.08}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.31}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.27}}$ $\textbf{45.33}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.31}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.19}}$

Method R1@0.3 R1@0.5 R1@0.7 mIoU $\text{EMTM}_{w/oKAU}$ 72.4757.6338.5852.18 $\text{EMTM}_{3,3,3}$ 71.0856.9937.8252.06EMTM72.7057.9139.8053.00

4.5 Ablation Studies

In this part, we perform ablation studies to analyze the effectiveness of the EMTM. All experiments are performed three times with different random seeds.

4.5.1 Effects of Components

In our proposed framework, we design the sharing encoders (SE) to learn shallow knowledge from the teacher by label distillation(LD). To better reflect the effects of these two main components, we measure the performance of different combinations.As table 4 and 5 show, each interaction component has a positive effect on the TSGV task. On Charades-STA, the full model outperforms w/o SE by gains of 1.44% on metrics “R@1, IoU=0.7” and exceeds “w/o LD” byon the all metrics. Besides, the full model also outperforms “w/o SE-LD” by a large margin on all metrics. Similarly, our full model has made significant improvements in all metrics compared with the variant "EMTM w/o SE-LD" on ActivityNet.

Additionally, we conduct two ablation experiments of KAU for analysis in table 6. As shown, KAU with the kernel size 3, 5, 7 instead of 3, 3, 3 does work.

4.5.2 Effect of Number of Teacher Models

We investigate the influence of different numbers of teacher models on Charades-STA. As shown in Figure 4, the performance presents a rising tendency with the increase of teachers.According to the results, we realize that our improvements are not only from soft targets with one single teacher, but also from the learning of structural knowledge and intermediate-level knowledge with fused multi-teacher teaching. Multiple teachers make knowledge distillation more flexible, while ensemble helps improve the training of students and transfer related information of examples to them.

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (5)

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (6)

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (7)

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (8)

4.5.3 Effect of Different Degree of Lightweight Models

We evaluate the influence of different degrees of lightweight models by adjusting their hidden dimension $d$ on Charades-STA. As shown in Figure 5, obviously as $d$ decreases, the FLOPs and model parameter size will decline, which would also reduce the performance. From 128 to 64 for $d$ , both R1@0.7 and mIoU reduce by about 5%, while FLOPs and model parameter size drop by a small margin. For the trade-offs, we select 128 as the hidden dimension.

4.6 Qualitative Analysis

Two samples of prediction on Charades-STA are depicted in Figure 6.The first sample indicates our approach can refine the predictions when the basic model already obtained satisfactory results.The second sample shows the basic model tends to predict the boundary position, possibly due to its limited understanding of the video.As a result, the model relies on biased positional information to make moment predictions.However, utilizing a shared encoder and label distillation approach can provide additional information that enables the model to more precisely predict the moment boundary.

5 Conclusion

In this paper, we focus on the efficiency of the model on TSVG and try to expand the efficiency interval to cover the entire model.A knowledge distillation framework (EMTM) is proposed, which utilizes label distillation from multiple teachers and a shared encoder strategy.In the future, we will pay attention to video feature extraction in TSGV, which is also a time-consume process.We will propose an end-to-end model that input video frames.

References

CabaHeilbron etal. [2015]Fabian CabaHeilbron, Victor Escorcia, Bernard Ghanem, and Juan CarlosNiebles.Activitynet: A large-scale video benchmark for human activity understanding.In CVPR, pages 961–970, 2015.
Carreira and Zisserman [2017]Joao Carreira and Andrew Zisserman.Quo vadis, action recognition? a new model and the kinetics dataset.In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
Dong and Yin [2022]Jianxiang Dong and Zhaozheng Yin.Boundary-aware temporal sentence grounding with adaptive proposal refinement.In Proceedings of the Asian Conference on Computer Vision, pages 3943–3959, 2022.
f*ckuda etal. [2017]Takashi f*ckuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran.Efficient knowledge distillation from an ensemble of teachers.In Interspeech, pages 3697–3701, 2017.
Gao and Xu [2021]Junyu Gao and Changsheng Xu.Fast video moment retrieval.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1523–1532, 2021.
Gao etal. [2017]Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia.Tall: Temporal activity localization via language query.In Proceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017.
Hinton etal. [2015]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015.
Huang etal. [2022]Jiabo Huang, Hailin Jin, Shaogang Gong, and Yang Liu.Video activity localisation with uncertainties in temporal boundary.In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIV, pages 724–740. Springer, 2022.
Li etal. [2021]Zheng Li, Jingwen Ye, Mingli Song, Ying Huang, and Zhigeng Pan.Online knowledge distillation for efficient pose estimation.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11740–11750, 2021.
Liu etal. [2022]Jihao Liu, Boxiao Liu, Hongsheng Li, and Yu Liu.Meta knowledge distillation.arXiv preprint arXiv:2202.07940, 2022.
Pennington etal. [2014]Jeffrey Pennington, Richard Socher, and ChristopherD Manning.Glove: Global vectors for word representation.In EMNLP, pages 1532–1543, 2014.
Regneri etal. [2013]Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal.Grounding action descriptions in videos.ACL, 1:25–36, 2013.
Romero etal. [2014]Adriana Romero, Nicolas Ballas, SamiraEbrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio.Fitnets: Hints for thin deep nets.arXiv preprint arXiv:1412.6550, 2014.
Sigurdsson etal. [2016]GunnarA Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta.Hollywood in homes: Crowdsourcing data collection for activity understanding.In ECCV, pages 510–526, 2016.
Wang and Yoon [2021]Lin Wang and Kuk-Jin Yoon.Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
Wu etal. [2022]Ziyue Wu, Junyu Gao, Shucheng Huang, and Changsheng Xu.Learning commonsense-aware moment-text alignment for fast video temporal grounding.arXiv preprint arXiv:2204.01450, 2022.
Yang and Wu [2022]Shuo Yang and Xinxiao Wu.Entity-aware and motion-aware transformers for language-driven action localization.In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, LD Raedt, Ed, pages 1552–1558, 2022.
You etal. [2017]Shan You, Chang Xu, Chao Xu, and Dacheng Tao.Learning from multiple teacher networks.In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 1285–1294, New York, NY, USA, 2017. Association for Computing Machinery.
Yu etal. [2018]AdamsWei Yu, David Dohan, Quoc Le, Thang Luong, Rui Zhao, and Kai Chen.Fast and accurate reading comprehension by combining self-attention and convolution.In International conference on learning representations, 2018.
Yuan etal. [2019]Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu.Semantic conditioned dynamic modulation for temporal sentence grounding in videos.Advances in Neural Information Processing Systems, 32, 2019.
Zeiler and Fergus [2014]MatthewD Zeiler and Rob Fergus.Visualizing and understanding convolutional networks.In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 818–833. Springer, 2014.
Zhang etal. [2020a]Hao Zhang, Aixin Sun, Wei Jing, and JoeyTianyi Zhou.Span-based localizing network for natural language video localization.arXiv preprint arXiv:2004.13931, 2020a.
Zhang etal. [2021]Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, JoeyTianyi Zhou, and Rick SiowMong Goh.Parallel attention network with sequence matching for video grounding.arXiv preprint arXiv:2105.08481, 2021.
Zhang etal. [2023]H. Zhang, A. Sun, W. Jing, and J. Zhou.Temporal sentence grounding in videos: A survey and future directions.IEEE Transactions on Pattern Analysis & Machine Intelligence, (01):1–20, 2023.
Zhang etal. [2020b]Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo.Learning 2d temporal adjacent networks for moment localization with natural language.In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12870–12877, 2020b.
Zheng etal. [2022]Minghang Zheng, Yanjie Huang, Qingchao Chen, Yuxin Peng, and Yang Liu.Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15555–15564, 2022.