Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (2024)

Renjie Liang
National University Singapore
liangrj5@gmail.com
  Yiming Yang
National University Singapore
e0920761@u.nus.edu
  Hui Lu
Nanyang Technological University
hui007@ntu.edu.sg
  Li Li
National University Singapore
lili02@u.nus.edu

Abstract

Temporal Sentence Grounding in Videos (TSGV) aims to retrieve the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy, suffering from inefficiency and heaviness. Previous attempts to address this issue have primarily concentrated on feature fusion layers. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from multiple networks.Specifically, We first unify different outputs of the different models. Next, a Knowledge Aggregation Unit (KAU) is built to acquire high-quality integrated soft labels from multiple teachers. Additionally, we propose a Shared Encoder strategy to enhance the learning of shallow layers. Extensive experimental results on three popular TSGV benchmarks demonstrate that our method is both effective and efficient. Our code is available at https://github.com/renjie-liang/EMET.

1 Introduction

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (1)
Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (2)

Temporal Sentence Grounding in Videos (TSGV), which aims to ground a temporal segment in an untrimmed video with a natural language query, has drawn widespread attention over the past few years [24]. There is a clear trend that top-performing models are becoming larger with numerous parameters.Additionally, the recent work shows that accuracy in TSGV tasks has reached a bottleneck period, while the combination of complex networks and multiple structures is becoming more prevalent to further improve the ability of the model,which will cause an expansion of the model size. However, the heavy resource cost required by the approaches restricts their applications.

In order to improve efficiency, FMVR [5] and CCA [16] are proposed to construct fast TSGV models by reducing the fusion time.However, they only decline the inferred time significantly, the whole network is still time-consuming, as depicted in Figure 1(a).Another shortage is that FMVR and CCA are required to store the feature after the encoder with extra storage.

To tackle this challenge, we extend the accelerated span to the entire network, shown in Figure 1(a).A natural approach is to reduce the complexity of the network, which can involve decreasing the hidden dimension, reducing the number of layers, and eliminating auxiliary losses. Nevertheless, all of these methods will lead to a decrease in performance to some extent.One promising technique is knowledge distillation [7] to mitigate the decrease in performance and maintain high levels of accuracy when lighting the network.

Initially, knowledge distillation employed a single teacher, but as technology advanced, multiple teachers have been deemed beneficial for imparting greater knowledge[4], as extensively corroborated in other domains[15].Multi-teacher strategy implies that there is a more diverse range of dark knowledge to be learned, with the optimal knowledge being more likely to be present [18].Thus far, multiple-teacher knowledge distillation has not been studied and exploited for the TSGV task.

An immediate problem is that different models will produce heterogeneous output, e.g., candidates for proposed methods, or probability distribution for proposal-free methods. A question is how to identify optimal knowledge from multiple teachers.In addition, knowledge was hardly backpropagted to the front layers from the soft label in the last layers [13], meaning that the front part of the student model usually hardly enjoys the benefit of teachers’ knowledge.Until now, here are three issues we need to deal with: i) how to unify knowledge from the different models, ii) how to select the optimal knowledge and assign weights among these teachers, and iii) how the front layers of the student benefit from the teachers.

Several challenges arise: Firstly, different models may yield heterogeneous outputs such as varying candidate spans or probability distributions.Secondly, backpropagating knowledge is difficult from the last to the front layers, limiting the student shallow layer to learn from teacher models [13].

For the first challenge, we unify the heterogeneity of model outputs by converting them into a unified 1D probability distribution. This enables us to seamlessly integrate the knowledge during the model training.The 1D distribution, derived from the span-based method in the proposal-free catalog, offers a speed advantage over proposal-based methods.

To integrate knowledge from various models, we have developed a Knowledge Aggregation Unit (KAU). This unit utilizes multi-scale information [9] to derive a higher-quality target distribution, moving beyond mere averaging of probabilities. The KAU adaptively assigns weights to different teacher models. This approach overcomes the limitations of manually tuning weights, a sensitive hyperparameter in multi-teacher distillation [10].

Regarding the second challenge, we implemented a shared layer strategy to facilitate the transfer of shallow knowledge from the teacher to the student model. This involves co-training a teacher model with our student model, sharing encoder layers and aligning hidden states. Such an arrangement ensures comprehensive and global knowledge acquisition by the student model.

During inference, we only exploit the student model to perform inference, which does not add computational overhead.To sum up, this paper’s primary contributions can be distilled into three main points, which are outlined below:

  • We propose a multi-teacher knowledge distillation framework for the TSGV task. This approach substantially reduces the time consumed and significantly decreases the number of parameters, while still maintaining high levels of accuracy.

  • To enable the whole student to benefit from various teacher models, we unify the knowledge from different models and use the KAU module to adaptively integrate to a single soft label. Additionally, a shared encoder strategy is utilized to share knowledge from the teacher model in front layers.

  • Extensive experimental results on three popular TSGV benchmarks demonstrate that our proposed method performs superior to the state-of-the-art methods and has the highest speed and minimal parameters and computation.

2 Related Work

Given an untrimmed video, temporal sentence grounding in videos (TSGV) is to retrieve a video segment according to a query, which is also known as Video Moment Retrieval (VMR).Existing solutions to video grounding are roughly categorized into proposal-based and proposal-free frameworks. We also introduce some works on fast video temporal grounding as follows.

2.1 Proposal-based Methods

The majority of proposal-based approaches rely on some carefully thought-out dense sample strategies, which gather a set of video segments as candidate proposals and rank them in accordance with the scores obtained for the similarity between the proposals and the query to choose the most compatible pairs.Zhang etal. [25] convert visual features into a 2D temporal map and encode the query in sentence-level representation, which is the first solution to model proposals with a 2D temporal map (2D-TAN).BAN-APR [3] utilize a boundary-aware feature enhancement module to enhance the proposal feature with its boundary information by imposing a new temporal difference loss. Currently, most proposal-based methods are time-consuming due to the large number of proposal-query interactions.

2.2 Proposal-free Methods

Actually, the caliber of the sampled proposals has a significant impact on the impressive performance obtained by proposal-based methods. To avoid incurring the additional computational costs associated with the production of proposal features, proposal-free approaches directly regress or forecast the beginning and end times of the target moment.VSLNet Zhang etal. [22] exploits context-query attention modified from QANet Yu etal. [19] to perform fine-grained multimodal interaction. Then a conditioned span predictor computes the probabilities of the start/end boundaries of the target moment.SeqPAN [23] design a self-guided parallel attention module to effectively capture self-modal contexts and cross-modal attentive information between video and text inspired by sequence labeling tasks in natural language processing.Yang and Wu [17] propose Entity-Aware and Motion-Aware Transformers (EAMAT) that progressively localize actions in videos by first coarsely locating clips with entity queries and then finely predicting exact boundaries in a shrunken temporal region with motion queries.Nevertheless, with the improvement of performance, huge and complex architectures inevitably result in higher computational cost during inference phase.

2.3 Fast Video Temporal Grounding

Recently, fast video temporal grounding has been proposed for more practical applications.According to [5], the standard TSGV pipeline can be divided into three components. The visual encoder and the text encoder are proved to have little influence in model testing due to the features pre-extracted and stored at the beginning of the test, and cross-modal interaction is the key to reducing the test time. Thus, a fine-grained semantic distillation framework is utilized to leverage semantic information for improving performance. Besides, Wu etal. [16] utilize commonsense knowledge to obtain bridged visual and text representations, promoting each other in common space learning.However, based on our previous analysis, the inferred time proposed by [5] is only a part of the entire prediction processing. The processing from inputting video features to predicting timestamps is still time-consuming.

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (3)

3 Methodology

In this section, we first give a brief task definition of TSGV in Section 3.1. In the following, heterogeneous knowledge unification is presented as a prerequisite in Section 3.2.1. Then we introduce the student network (Section 3.2.2), shared encoder strategy (Section 3.2.5) and knowledge aggregation unit (Section 3.2.4) as shown in 2.

Finally, the training and inference processes are presented in section 3.3, as well as the loss settings.

3.1 Problem Formulation

Given an untrimmed video V=[ft]t=1T𝑉superscriptsubscriptdelimited-[]subscript𝑓𝑡𝑡1𝑇V=[f_{t}]_{t=1}^{T}italic_V = [ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and the language query Q=[qj]j=1m𝑄superscriptsubscriptdelimited-[]subscript𝑞𝑗𝑗1𝑚Q=[q_{j}]_{j=1}^{m}italic_Q = [ italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, where T𝑇Titalic_T and m𝑚mitalic_m are the numbers of frames and words, respectively. The start and end times of the ground truth moment are indicated by τssuperscript𝜏𝑠\tau^{s}italic_τ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and τesuperscript𝜏𝑒\tau^{e}italic_τ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, 1τs<τeT1subscript𝜏𝑠subscript𝜏𝑒𝑇1\leq\tau_{s}<\tau_{e}\leq T1 ≤ italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT < italic_τ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ≤ italic_T.Mathematically, TSGV is to retrieve the target moment starting from τssubscript𝜏𝑠\tau_{s}italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and ending at τesubscript𝜏𝑒\tau_{e}italic_τ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT by giving a video V𝑉Vitalic_V and query Q𝑄Qitalic_Q, i.e., TSGV:(V,Q)(τs,τe):subscript𝑇𝑆𝐺𝑉maps-to𝑉𝑄subscript𝜏𝑠subscript𝜏𝑒\mathcal{F}_{TSGV}:(V,Q)\mapsto(\tau_{s},\tau_{e})caligraphic_F start_POSTSUBSCRIPT italic_T italic_S italic_G italic_V end_POSTSUBSCRIPT : ( italic_V , italic_Q ) ↦ ( italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ).

3.2 General Scheme

3.2.1 Heterogeneous Knowledge Unification

Compared to the proposal-based method, the span-based method doesn’t need to generate redundant proposals, which is an inherent advantage in terms of efficiency. Meanwhile, 1D distribution carries more knowledge than the regression-based method.Hence we unify various heterogeneous outputs into 1D probability distribution and develop our network based on the span-based method, as shown in Figure 2. The outputs of the span-based method are the 1D probability distributions of start and end moments, denoted as Ps,Pensubscript𝑃𝑠subscript𝑃𝑒superscript𝑛P_{s},P_{e}\in\mathbb{R}^{n}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. To keep concise, we adopt P2n𝑃superscript2𝑛P\in\mathbb{R}^{2n}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT without subscripts to express stacked probability for the start and end moments.

We simply adopt the softmax function to the outputs of the span-based methods and obtain probability distributions.

Ps=Softmax(Ps)Pe=Softmax(Pe)formulae-sequencesubscript𝑃𝑠𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscriptsuperscript𝑃𝑠subscript𝑃𝑒𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscriptsuperscript𝑃𝑒\displaystyle P_{s}=Softmax(P^{\prime}_{s})\quad P_{e}=Softmax(P^{\prime}_{e})italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT )(1)

2D-map anchor-based method is a common branch of the proposal-base method, such as [25], [3]. A 2D map S=[si,j]n×n𝑆delimited-[]subscript𝑠𝑖𝑗superscript𝑛𝑛S=[s_{i,j}]\in\mathbb{R}^{n\times n}italic_S = [ italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is generatedto model temporal relations between proposal candidates, on which one dimension indicates the start moment and the other indicates the end moment. We calculate the max scores of S𝑆Sitalic_S by row/column as start/end distributions.

Ps=subscript𝑃𝑠absent\displaystyle P_{s}=italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =Softmax(maxjsi,j)𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑗subscript𝑠𝑖𝑗\displaystyle Softmax(\max\limits_{j}s_{i,j})italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT )(2)
Pe=subscript𝑃𝑒absent\displaystyle P_{e}=italic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT =Softmax(maxisi,j)𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑖subscript𝑠𝑖𝑗\displaystyle Softmax(\max\limits_{i}s_{i,j})italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT )

As for the regression-based method, we can get a time pair (ts,te)subscript𝑡𝑠subscript𝑡𝑒(t_{s},t_{e})( italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) after computation. Then the Gaussian distribution is leveraged to simulate the probability distribution of the start/end moments as follows:

Ps=subscript𝑃𝑠absent\displaystyle P_{s}=italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =Softmax(N(ts,σ2))𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑁subscript𝑡𝑠superscript𝜎2\displaystyle Softmax({N(t_{s},\sigma^{2})})italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_N ( italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )(3)
Pe=subscript𝑃𝑒absent\displaystyle P_{e}=italic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT =Softmax(N(te,σ2))𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑁subscript𝑡𝑒superscript𝜎2\displaystyle Softmax({N(t_{e},\sigma^{2})})italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_N ( italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )

The proposal-generated method will generate a triple candidate list S=(tsi,tei,ri)3×ksuperscript𝑆superscriptsubscript𝑡𝑠𝑖superscriptsubscript𝑡𝑒𝑖superscript𝑟𝑖superscript3𝑘S^{\prime}=(t_{s}^{i},t_{e}^{i},r^{i})\in\mathbb{R}^{3\times k}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_k end_POSTSUPERSCRIPT, where k𝑘kitalic_k is the number of proposal candidates. Similarly, we use the Gaussian distribution to generate the probability distribution of the start/end moment for each candidate. Then we put different weights on various candidates and accumulate them:

Ps=subscript𝑃𝑠absent\displaystyle P_{s}=italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =Softmax(iriN(tsi,σ2))𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑖superscript𝑟𝑖𝑁superscriptsubscript𝑡𝑠𝑖superscript𝜎2\displaystyle Softmax(\sum_{i}r^{i}{N(t_{s}^{i},\sigma^{2})})italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_N ( italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )(4)
Pe=subscript𝑃𝑒absent\displaystyle P_{e}=italic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT =Softmax(iriN(tei,σ2))𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑖superscript𝑟𝑖𝑁superscriptsubscript𝑡𝑒𝑖superscript𝜎2\displaystyle Softmax(\sum_{i}r^{i}{N(t_{e}^{i},\sigma^{2})})italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_N ( italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )

where σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the variance of Gaussian distribution N𝑁Nitalic_N.

3.2.2 Student Network

The design of the student network emphasizes efficient processing. The video feature, represented as 𝑽n×dv𝑽superscript𝑛subscript𝑑𝑣\bm{V}\in\mathbb{R}^{n\times d_{v}}bold_italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, utilizes the I3D framework, as proposed by Carreira et al. [2] and further refined by Zhang [23].In parallel, the query feature, denoted by 𝑸m×dq𝑸superscript𝑚subscript𝑑𝑞\bm{Q}\in\mathbb{R}^{m\times d_{q}}bold_italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, is initialized using GloVe embeddings, with n𝑛nitalic_n indicating the length of the features.Both the video and query features undergo projection and encoding processes, which align their dimensions to facilitate uniformity and interoperability in subsequent computational stages.

𝑽superscript𝑽bold-′\displaystyle\bm{V^{\prime}}bold_italic_V start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT=𝚅𝚒𝚜𝚞𝚊𝚕𝙴𝚗𝚌𝚘𝚍𝚎𝚛(𝑽)absent𝚅𝚒𝚜𝚞𝚊𝚕𝙴𝚗𝚌𝚘𝚍𝚎𝚛𝑽\displaystyle=\mathtt{VisualEncoder}(\bm{V})= typewriter_VisualEncoder ( bold_italic_V )(5)
𝑸superscript𝑸bold-′\displaystyle\bm{Q^{\prime}}bold_italic_Q start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT=𝚀𝚞𝚎𝚛𝚢𝙴𝚗𝚌𝚘𝚍𝚎𝚛(𝑸)absent𝚀𝚞𝚎𝚛𝚢𝙴𝚗𝚌𝚘𝚍𝚎𝚛𝑸\displaystyle=\mathtt{QueryEncoder}(\bm{Q})= typewriter_QueryEncoder ( bold_italic_Q )

Subsequently, a lightweight transformer is employed to fuse the video and query features. This approach, in contrast to the direct dot multiplication method referenced in Gao et al. [5], significantly enhances performance without imposing excessive computational load on the model.

𝑽qv=𝚃𝚛𝚊𝚗𝚜𝚏𝚘𝚛𝚖𝚎𝚛(𝑽,𝑸)superscript𝑽𝑞𝑣𝚃𝚛𝚊𝚗𝚜𝚏𝚘𝚛𝚖𝚎𝚛superscript𝑽bold-′superscript𝑸bold-′\bm{V}^{qv}=\mathtt{Transformer}(\bm{V^{\prime}},\bm{Q^{\prime}})bold_italic_V start_POSTSUPERSCRIPT italic_q italic_v end_POSTSUPERSCRIPT = typewriter_Transformer ( bold_italic_V start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT , bold_italic_Q start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT )(6)

Finally, a predictor is utilized to generate the logits corresponding to the start and end points. Then multiply 𝑷𝒔subscript𝑷𝒔\bm{P_{s}}bold_italic_P start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT and 𝑷𝒆subscript𝑷𝒆\bm{P_{e}}bold_italic_P start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT to form a matrix, within which the highest value is identified. The row and column indices of this peak value correspond to the predicted start and end indices, respectively.

(𝑷𝒔,𝑷𝒆)=𝙿𝚛𝚎𝚍𝚒𝚌𝚝𝚘𝚛(𝑽𝒒𝒗)nsubscript𝑷𝒔subscript𝑷𝒆𝙿𝚛𝚎𝚍𝚒𝚌𝚝𝚘𝚛superscript𝑽𝒒𝒗superscript𝑛\displaystyle(\bm{P_{s}},\bm{P_{e}})=\mathtt{Predictor}(\bm{V^{qv}})\in\mathbb%{R}^{n}( bold_italic_P start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT bold_italic_e end_POSTSUBSCRIPT ) = typewriter_Predictor ( bold_italic_V start_POSTSUPERSCRIPT bold_italic_q bold_italic_v end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT(7)

3.2.3 Teacher Network

The teacher network is architected with a focus on performance optimization, in contrast to the student network which is streamlined for efficiency. The specific differences between the two networks are comprehensively itemized in Table 1. Cross-modal fusion layers have been incorporated to augment the interactivity among disparate modalities within the teacher model.Following the Encoder stage, a group of transformations is employed to facilitate the amalgamation of query and video features. Subsequently, these cross-modal features are fed into the fusion layer.

𝑽superscript𝑽bold-′\displaystyle\bm{V^{\prime}}bold_italic_V start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT=𝚃𝚛𝚊𝚗𝚜𝚏𝚘𝚛𝚖𝚎𝚛(𝑽,𝑸)absent𝚃𝚛𝚊𝚗𝚜𝚏𝚘𝚛𝚖𝚎𝚛superscript𝑽bold-′superscript𝑸bold-′\displaystyle=\mathtt{Transformer}(\bm{V^{\prime}},\bm{Q^{\prime}})= typewriter_Transformer ( bold_italic_V start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT , bold_italic_Q start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT )(8)
𝑸superscript𝑸bold-′\displaystyle\bm{Q^{\prime}}bold_italic_Q start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT=𝚃𝚛𝚊𝚗𝚜𝚏𝚘𝚛𝚖𝚎𝚛(𝑸,𝑽)absent𝚃𝚛𝚊𝚗𝚜𝚏𝚘𝚛𝚖𝚎𝚛superscript𝑸bold-′superscript𝑽bold-′\displaystyle=\mathtt{Transformer}(\bm{Q^{\prime}},\bm{V^{\prime}})= typewriter_Transformer ( bold_italic_Q start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT )

ModuleLayerTeacherStudentProjectorCONV1D_query11CONV1D_video44Fusion LayerTransformer_query40Transformer_video40Transformer_fusion11PredictorCONV1D_start63CONV1D_end63

MethodYearCharades-STAActivityNetTACoSFLOPS (B)Params (M)Times (ms)sumACCFLOPS (B)Params (M)Times (ms)sumACCFLOPS (B)Params (M)Times (ms)sumACCSCDM201916.500012.8800-87.87260.230015.6500-56.61260.230015.6500--2D-TAN202052.261669.060613.342566.051067.900082.440077.990371.431067.900082.440077.9903*36.82VSLNet20200.03000.78288.002077.500.05210.80058.989369.380.06300.80058.989344.30SeqPAN20210.02091.186310.5168102.200.02141.214313.713873.870.02181.235923.302567.71EMB20220.08852.216822.390097.580.20336.151525.087170.880.28172.217223.634960.36EAMAT20221.288194.121556.1753103.654.154593.0637125.782260.944.154593.0637125.782264.98BAN-APR20229.452734.649119.9767105.9625.468845.671444.858777.7925.468845.671444.8587*52.10CPL20223.44445.375726.845171.633.89297.011526.442349.14----CNM20220.52605.37115.448250.100.50637.00744.8629*48.96----FVMR2021---88.75---71.85----CCA2022137.298479.767126.973489.41151.102322.570931.5400*75.95151.102322.570931.540050.90EMTM (Ours)0.00810.65694.799892.800.00840.68483.543170.910.00870.70654.573758.24

3.2.4 Knowledge Aggregation Unit

Our goal is to combine all the unified predictions from b𝑏bitalic_b branches to establish a strong teacher distribution.Drawing inspiration from [9], we developed the Knowledge Aggregation Unit (KAU). The KAU integrates parallel transformations with varying receptive fields, harnessing both local and global contexts to derive a more accurate target probability. The KAU’s design is illustrated in Figure 3.

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (4)

Considering saving more original information, we first take the video features Vsuperscript𝑉V^{\prime}italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in the eq. (5) as input and then add convolution layers. The convolution operation is conducted with a small kernel size of 3 initially and then consistently increased to 5 and 7.Further, we incorporate the average pooling of query features Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in the eq. (5) for richer representations. Then we concatenate all the splits and obtain the intermediate vector v𝑣vitalic_v, denoted as:

v=[qavg,g([vconv3,vconv5,vconv7])]𝑣subscript𝑞𝑎𝑣𝑔𝑔subscript𝑣𝑐𝑜𝑛𝑣3subscript𝑣𝑐𝑜𝑛𝑣5subscript𝑣𝑐𝑜𝑛𝑣7\displaystyle v=[q_{avg},g([v_{conv3},v_{conv5},v_{conv7}])]italic_v = [ italic_q start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT , italic_g ( [ italic_v start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v 3 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v 5 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v 7 end_POSTSUBSCRIPT ] ) ](9)

where qavgsubscript𝑞𝑎𝑣𝑔q_{avg}italic_q start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT denotes the result of Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT after average pooling, g(·) denotes the global pooling function, vconv3subscript𝑣𝑐𝑜𝑛𝑣3v_{conv3}italic_v start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v 3 end_POSTSUBSCRIPT, vconv5subscript𝑣𝑐𝑜𝑛𝑣5v_{conv5}italic_v start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v 5 end_POSTSUBSCRIPT, and vconv7subscript𝑣𝑐𝑜𝑛𝑣7v_{conv7}italic_v start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v 7 end_POSTSUBSCRIPT denote the results after the convolution layers with kernel size 3, 5, and 7, respectively.

Passing through a fully connected layer FC𝐹𝐶FCitalic_F italic_C, a channel-wise softmax operator is applied to obtain the soft attention a𝑎aitalic_a.

a=Softmax(𝙵𝙲(v))2b×n𝑎𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝙵𝙲𝑣superscript2𝑏𝑛\displaystyle a=Softmax(\mathtt{FC}(v))\in\mathbb{R}^{2b\times n}italic_a = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( typewriter_FC ( italic_v ) ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_b × italic_n end_POSTSUPERSCRIPT(10)

where the b𝑏bitalic_b denote the number of teacher branch, 2b2𝑏2b2 italic_b is because there are two probability distributions (i.e., start and end).

Finally, we fuse prediction results from multiple branches via an element-wise summation to obtain the weighted ensemble probability.

P~=i=1baiP^i2×n~𝑃superscriptsubscript𝑖1𝑏tensor-productsuperscript𝑎𝑖superscript^𝑃𝑖superscript2𝑛\displaystyle\widetilde{P}=\sum_{i=1}^{b}{a^{i}\otimes\hat{P}^{i}}\in\mathbb{R%}^{2\times n}over~ start_ARG italic_P end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⊗ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_n end_POSTSUPERSCRIPT(11)

where P~~𝑃\widetilde{P}over~ start_ARG italic_P end_ARG denotes the ensemble probability, Pi2×nsuperscript𝑃𝑖superscript2𝑛P^{i}\in\mathbb{R}^{2\times n}italic_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_n end_POSTSUPERSCRIPT means the start and end distribution from i𝑖iitalic_i-th teacher branch, and tensor-product\otimes refers to the channel-wise multiplication. Our experiments (see Section 4.5.1) prove that the weights generated by KAU can achieve better distillation performance.

3.2.5 Shared Encoder Strategy

The backpropagation of knowledge from soft labels often provides limited benefit to the shallow layers, primarily due to the influence of non-linear activation functions and dropout mechanisms. However, the concept of feature invariance in these layers, in Zeiler et al.[21], guides our approach. We propose the sharing of several shallow layers between the student and teacher networks. This collaborative training strategy enables the shallow layers in the student network to assimilate additional knowledge from the teacher network, enhancing their learning capacity.

Specifically, a student and a teacher share their text and query encoder, shown in Figure 2. The encoder consists of several conv1D in our network, which is lightweight and fast due to its inherent characteristics.The 𝚅𝚒𝚜𝚞𝚊𝚕𝙴𝚗𝚌𝚘𝚍𝚎𝚛𝚅𝚒𝚜𝚞𝚊𝚕𝙴𝚗𝚌𝚘𝚍𝚎𝚛\mathtt{VisualEncoder}typewriter_VisualEncoder, 𝚀𝚞𝚎𝚛𝚢𝙴𝚗𝚌𝚘𝚍𝚎𝚛𝚀𝚞𝚎𝚛𝚢𝙴𝚗𝚌𝚘𝚍𝚎𝚛\mathtt{QueryEncoder}typewriter_QueryEncoder in eq. (5) denotethe shared layers in our network.

3.3 Training and Inference

3.3.1 TSGV Loss

The overall training loss of our model is described as follows. For the student and the teacher, the hard loss (i.e. label loss) is used to optimize distributions of start/end boundaries.

Llocst=fCE(Pst,Y)subscriptsuperscript𝐿𝑠𝑡𝑙𝑜𝑐subscript𝑓𝐶𝐸superscript𝑃𝑠𝑡𝑌\displaystyle L^{st}_{loc}=f_{CE}(P^{st},Y)italic_L start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT , italic_Y )(12)
Lloctc=fCE(Ptc,Y)subscriptsuperscript𝐿𝑡𝑐𝑙𝑜𝑐subscript𝑓𝐶𝐸superscript𝑃𝑡𝑐𝑌\displaystyle L^{tc}_{loc}=f_{CE}(P^{tc},Y)italic_L start_POSTSUPERSCRIPT italic_t italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_t italic_c end_POSTSUPERSCRIPT , italic_Y )

where fCEsubscript𝑓𝐶𝐸f_{CE}italic_f start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT is the cross-entropy function, and Y𝑌Yitalic_Y is one-hot labels for the start and end boundaries of ground truth. Similarly, we encourage ensemble probability to get closer to ground truth distribution.

Llocens=fCE(P~,Y)subscriptsuperscript𝐿𝑒𝑛𝑠𝑙𝑜𝑐subscript𝑓𝐶𝐸~𝑃𝑌\displaystyle L^{ens}_{loc}=f_{CE}(\widetilde{P},Y)italic_L start_POSTSUPERSCRIPT italic_e italic_n italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over~ start_ARG italic_P end_ARG , italic_Y )(13)

As we discussed previously, the learned ensemble information serves as complementary cues to provide an enhanced supervisory signal to our student model. As a result, we introduce multiple distillation learning, which transfers the rich knowledge in the form of softened labels. The formulation is given by:

Ldis=fKL(softmax(Pst,t),softmax(P~,t)\displaystyle L_{dis}=f_{KL}(softmax(P^{st},t),softmax(\widetilde{P},t)italic_L start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_P start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT , italic_t ) , italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( over~ start_ARG italic_P end_ARG , italic_t )(14)

where fKLsubscript𝑓𝐾𝐿f_{KL}italic_f start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT represents the KL divergence. The t𝑡titalic_t is the temperature in knowledge distillation, which control the smoothness of the output distribution.

Based on the above design, the overall objective for a training video-query pair is formulated as:

L=Llocst+Lloctc+Llocens+αLdis𝐿subscriptsuperscript𝐿𝑠𝑡𝑙𝑜𝑐subscriptsuperscript𝐿𝑡𝑐𝑙𝑜𝑐subscriptsuperscript𝐿𝑒𝑛𝑠𝑙𝑜𝑐𝛼subscript𝐿𝑑𝑖𝑠\displaystyle L=L^{st}_{loc}+L^{tc}_{loc}+L^{ens}_{loc}+\alpha L_{dis}italic_L = italic_L start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT italic_t italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT italic_e italic_n italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT + italic_α italic_L start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT(15)

where α𝛼\alphaitalic_α is a balance term.

3.3.2 Inference

The teacher and student models will be collaboratively trained, while we only adopt the student model for TSGV during testing. The learned rich information serves as complementary cues to provide an enhanced supervisory signal to the TSGV model.Compared with FMVR [5] and CCA [16], we won’t pre-calculate and store visual features.

MethodCharades-STAActivityNetTACoSR1@0.3R1@0.5R1@0.7mIoUR1@0.3R1@0.5R1@0.7mIoUR1@0.3R1@0.5R1@0.7mIoUSCDM-54.4433.43-54.8036.7519.86-26.1121.17--2D-TAN-42.8023.25-58.7544.0527.38-35.1725.1711.6524.16VSLNet64.3047.3130.1945.1563.1643.2226.1643.1929.6124.2720.0324.11SeqPAN73.8460.8641.3453.9261.6545.5028.3745.1148.6439.6428.0737.17EMB72.5058.3339.2553.0964.1344.8126.0745.5950.4637.8222.5435.49EAMAT74.1961.6941.9654.4555.3338.0722.8740.1250.1138.1626.8236.43BAN-APR*74.0563.6842.28*54.15*65.1148.1229.67*45.8748.2433.74*17.44*32.95CPL66.4049.2422.3943.4855.7331.3712.3236.82----CNM60.0435.1514.95-55.6833.33*12.81* 36.15----FVMR-55.0133.74-60.6345.0026.85-41.4829.12--CCA70.4654.1935.2250.0261.9946.5829.37*45.1145.3032.8318.07-EMTM (Ours)72.7057.9139.8053.0063.2044.7326.0845.3345.7834.8323.4134.44ΔSOTAsubscriptΔ𝑆𝑂𝑇𝐴\Delta_{SOTA}roman_Δ start_POSTSUBSCRIPT italic_S italic_O italic_T italic_A end_POSTSUBSCRIPT\uparrow 2.24\uparrow 2.90\uparrow 4.58\uparrow 2.98\uparrow 1.21\downarrow 1.85\downarrow 3.29\uparrow 0.22\uparrow 0.48\uparrow 2.42\uparrow 5.34-

4 Experiments

4.1 Datasets

To evaluate the performance of TSGV, we conduct experiments on three challenging datasets, all the queries in these datasets are in English. Details of these datasets are shown as follows:

Charades-STA[6] is composed of daily indoor activities videos, which is based on Charades dataset[14]. This dataset contains 6672 videos, 16,128 annotations, and 11,767 moments. The average length of each video is 30 seconds. 12,4081240812,40812 , 408 and 3,72037203,7203 , 720 moment annotations are labeled for training and testing, respectively;

ActivityNet Caption[1] is originally constructed for dense video captioning, which contains about 20202020k YouTube videos with an average length of 120 seconds. As a dual task of dense video captioning, TSGV utilizes the sentence description as a query and outputs the temporal boundary of each sentence description.

TACoS[12] is collected from MPII Cooking dataset[12], which has 127 videos with an average length of 286.59286.59286.59286.59 seconds.

4.2 Evaluation Metrics

Following existing video grounding works, we evaluate the performance on two main metrics:

mIoU:“mIoU" is the average predicted Intersection over Union in all testing samples. The mIoU metric is particularly challenging for short video moments;

Recall:We adopt “R@n,IoU=μR@𝑛IoU𝜇\textrm{R@}n,\textrm{IoU}=\muR@ italic_n , IoU = italic_μ” as the evaluation metrics, following[6]. The “R@n,IoU=μR@𝑛IoU𝜇\textrm{R@}n,\textrm{IoU}=\muR@ italic_n , IoU = italic_μ” represents the percentage of language queries having at least one result whose IoU between top-n𝑛nitalic_n predictions with ground truth is larger than μ𝜇\muitalic_μ. In our experiments, we reported the results of n=1𝑛1n=1italic_n = 1 and μ{0.3,0.5,0.7}𝜇0.30.50.7\mu\in\{0.3,0.5,0.7\}italic_μ ∈ { 0.3 , 0.5 , 0.7 }.

The Metric of Efficiency:Time, FLOPs, and Params are used to measure the efficiency of the model.Specifically, the time refers to the entire inferring time from the input of the video and query pair to the output of the prediction. FLOPs refers to floating point operations, which is used to measure the complexity of the model. Params refers to the model parameter size except the word embedding.

4.3 Implementation Details

For language query Q𝑄Qitalic_Q, we use the 300300300300-D GloVe[11] vectors to initialize each lowercase word, which are fixed during training. Following the previous methods, 3D convolutional features (I3D) are extracted to encode videos.We set the dimension of all the hidden layers as 128128128128, the kernel size of the convolutional layer as 7777, and the head size of multi-head attention as 8888 in our model.For all datasets, models are trained for 100100100100 epochs. The batch size is set to 16161616. The dropout rate is set as 0.2.Besides, an early stopping strategy is adopted to prevent overfitting.The whole framework is trained by Adam optimizer with an initial learning rate of 0.0001. The loss weight α𝛼\alphaitalic_α is set as 0.1 in all the datasets. The temperate was set to 1, 3, 3 on Charades-STA, ActivityNet, and TACoS.The pre-trained teacher models are selected in SeqPAN, BAN-APR, EAMAT, and CCA.More ablation studies can be found in Section4.5. All experiments are conducted on an NVIDIA RTX A5000 GPU with 24GB memory. All experiments were performed three times, and reporting the average of performance.

4.4 Comparison with State-of-the-art Methods

We strive to gather the most current approaches, and compare our proposed model with the following state-of-the-art baselines on three benchmark datasets:

  • Proposal-based Methods: SCDM [20], 2D-TAN [25], BAN-APR [3].

  • Proposal-free Methods: VSLNet [22], SeqPAN [23], EMB [8], EAMAT [17].

  • Weakly Supervised Methods: CPL [26], CNM [26]

  • Fast Methods: FVMR [5], CCA [16]

The best performance is highlighted in bold and the second-best is highlighted with underline in tables.

Overall Efficiency-Accuracy Analysis

Considering that fast TSGV task pays the same attention to efficiency as accuracy, we evaluate FLOPs, Params, and Times for each model. For a fair comparison, the batch size is set to 1 for all methods during inference. Besides, we also calculate the sum of the accuracy in terms of “R1@0.3” and “R1@0.5”, named sumACC to evaluate the whole performance of each model.

As Table 2 shows, our method surpasses all other methods and achieves the highest speed, minimal FLOPs and Params on all three datasets.We note that EMTM is at least 2000 times fewer in FLOPs than state-of-the-art proposal-based models (SCDM and 2D-TAN). According to sumACC, EMTM outperforms these two models by gains of at most 26.75% on Charades-STA and 14.30% on ActivtyNet.Despite the parameter size of VSLNet is at the same level as our method, we outperform it significantly in terms of accuracy, achieveing 15.30% absolute improvement by “sumACC” on Charades-STA.When it comes to CCA, which is proposed for fast TSGV, EMTM outperforms 16950x fewer in FLOPs and 121x fewer in model parameter size on Charades-STA.The above comparison illustrates that our method has significant efficiency and accuracy advantages.

Accuracy Analysis

As shown in Table 3, we can observe that our method performs better than extensive methods in most metrics on three benchmark datasets. Compared with FVMR and CCA, our model performers better in all metrics. Especially, EMTM achieves an absolute improvement of 4.58% on Charades- STA and 5.34% on TACoS on the metric "R@1, IoU=0.7", which is a more crucial criterion with higher quality.

The performance of our model on ActivityNet is slightly lower than CCA. The supposed reason may be that ActivityNet is more challenging since it covers a wide range of videos, not limited to daily indoor activities videos in Charades-STA and cooking videos in TACoS. In such cases, label distillation may not effectively capture the key features of the dataset, resulting in limited performance gains. Besides, the effectiveness of label distillation often depends on the performance of the teacher model. The backbone we use is SeqPAN, which has relatively poor performance on ActivityNet, limiting our upper bound even with label distillation. However, our framework can be adapted to any VMR model. If we replace it with the more powerful upcoming backbone models, the performance will surpass the current version.

MethodShared EncoderLabel Distillation R1@0.3 R1@0.5 R1@0.7 mIoUEMTM w/o SE-LD××70.190.99+0.97superscriptsubscript70.190.990.9770.19_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.99%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.97}}70.19 start_POSTSUBSCRIPT - 0.99 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.97 end_POSTSUPERSCRIPT56.231.01+0.62superscriptsubscript56.231.010.6256.23_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-1.01%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.62}}56.23 start_POSTSUBSCRIPT - 1.01 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.62 end_POSTSUPERSCRIPT36.490.74+0.39superscriptsubscript36.490.740.3936.49_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.74%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.39}}36.49 start_POSTSUBSCRIPT - 0.74 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.39 end_POSTSUPERSCRIPT51.341.06+0.98superscriptsubscript51.341.060.9851.34_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-1.06%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.98}}51.34 start_POSTSUBSCRIPT - 1.06 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.98 end_POSTSUPERSCRIPTEMTM w/o SE×73.331.34+0.84superscriptsubscript73.331.340.84\textbf{73.33}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-1.34}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.84}}73.33 start_POSTSUBSCRIPT - 1.34 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.84 end_POSTSUPERSCRIPT58.050.25+0.26superscriptsubscript58.050.250.26\textbf{58.05}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.25}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.26}}58.05 start_POSTSUBSCRIPT - 0.25 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.26 end_POSTSUPERSCRIPT38.36¯0.21+0.17superscriptsubscript¯38.360.210.17\underline{38.36}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.21}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.17}}under¯ start_ARG 38.36 end_ARG start_POSTSUBSCRIPT - 0.21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.17 end_POSTSUPERSCRIPT53.310.91+0.54superscriptsubscript53.310.910.54\textbf{53.31}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.91}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.54}}53.31 start_POSTSUBSCRIPT - 0.91 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.54 end_POSTSUPERSCRIPTEMTM w/o LD×72.620.52+0.69superscriptsubscript72.620.520.6972.62_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.52%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.69}}72.62 start_POSTSUBSCRIPT - 0.52 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.69 end_POSTSUPERSCRIPT56.510.84+1.18superscriptsubscript56.510.841.1856.51_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.84%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+1.18}}56.51 start_POSTSUBSCRIPT - 0.84 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 1.18 end_POSTSUPERSCRIPT37.540.50+0.85superscriptsubscript37.540.500.8537.54_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.50%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.85}}37.54 start_POSTSUBSCRIPT - 0.50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.85 end_POSTSUPERSCRIPT52.390.37+0.60superscriptsubscript52.390.370.6052.39_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.37%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.60}}52.39 start_POSTSUBSCRIPT - 0.37 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.60 end_POSTSUPERSCRIPTEMTM72.70¯0.55+0.47superscriptsubscript¯72.700.550.47\underline{72.70}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.55}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.47}}under¯ start_ARG 72.70 end_ARG start_POSTSUBSCRIPT - 0.55 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.47 end_POSTSUPERSCRIPT57.91¯0.65+0.75superscriptsubscript¯57.910.650.75\underline{57.91}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.65}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.75}}under¯ start_ARG 57.91 end_ARG start_POSTSUBSCRIPT - 0.65 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.75 end_POSTSUPERSCRIPT39.800.12+0.12superscriptsubscript39.800.120.12\textbf{39.80}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.12}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.12}}39.80 start_POSTSUBSCRIPT - 0.12 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.12 end_POSTSUPERSCRIPT53.00¯0.33+0.21superscriptsubscript¯53.000.330.21\underline{53.00}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.33}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.21}}under¯ start_ARG 53.00 end_ARG start_POSTSUBSCRIPT - 0.33 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.21 end_POSTSUPERSCRIPT

MethodShared EncoderLabel Distillation R1@0.3 R1@0.5 R1@0.7 mIoUEMTM w/o SE-LD××62.060.85+0.99superscriptsubscript62.060.850.9962.06_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.85%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.99}}62.06 start_POSTSUBSCRIPT - 0.85 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.99 end_POSTSUPERSCRIPT43.900.39+0.25superscriptsubscript43.900.390.2543.90_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.39%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.25}}43.90 start_POSTSUBSCRIPT - 0.39 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.25 end_POSTSUPERSCRIPT25.630.13+0.07superscriptsubscript25.630.130.0725.63_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.13%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.07}}25.63 start_POSTSUBSCRIPT - 0.13 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.07 end_POSTSUPERSCRIPT44.520.38+0.55superscriptsubscript44.520.380.5544.52_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.38%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.55}}44.52 start_POSTSUBSCRIPT - 0.38 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.55 end_POSTSUPERSCRIPTEMTM w/o SE×63.19¯0.22+0.35superscriptsubscript¯63.190.220.35\underline{63.19}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.22}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.35}}under¯ start_ARG 63.19 end_ARG start_POSTSUBSCRIPT - 0.22 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.35 end_POSTSUPERSCRIPT44.110.27+0.26superscriptsubscript44.110.270.2644.11_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.27%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.26}}44.11 start_POSTSUBSCRIPT - 0.27 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.26 end_POSTSUPERSCRIPT25.740.32+0.41superscriptsubscript25.740.320.4125.74_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.32%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.41}}25.74 start_POSTSUBSCRIPT - 0.32 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.41 end_POSTSUPERSCRIPT45.150.03+0.03superscriptsubscript45.150.030.0345.15_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.03%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.03}}45.15 start_POSTSUBSCRIPT - 0.03 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.03 end_POSTSUPERSCRIPTEMTM w/o LD×62.980.40+0.33superscriptsubscript62.980.400.3362.98_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}-0.40%}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}+0.33}}62.98 start_POSTSUBSCRIPT - 0.40 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.33 end_POSTSUPERSCRIPT44.680.15+0.19superscriptsubscript44.680.150.19\textbf{44.68}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.15}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.19}}44.68 start_POSTSUBSCRIPT - 0.15 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.19 end_POSTSUPERSCRIPT26.100.06+0.12superscriptsubscript26.100.060.12\textbf{26.10}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.06}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.12}}26.10 start_POSTSUBSCRIPT - 0.06 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.12 end_POSTSUPERSCRIPT45.22¯0.12+0.11superscriptsubscript¯45.220.120.11\underline{45.22}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.12}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.11}}under¯ start_ARG 45.22 end_ARG start_POSTSUBSCRIPT - 0.12 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.11 end_POSTSUPERSCRIPTEMTM63.200.58+0.30superscriptsubscript63.200.580.30\textbf{63.20}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.58}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.30}}63.20 start_POSTSUBSCRIPT - 0.58 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.30 end_POSTSUPERSCRIPT44.73¯0.33+0.58superscriptsubscript¯44.730.330.58\underline{44.73}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.33}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.58}}under¯ start_ARG 44.73 end_ARG start_POSTSUBSCRIPT - 0.33 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.58 end_POSTSUPERSCRIPT26.08¯0.31+0.27superscriptsubscript¯26.080.310.27\underline{26.08}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}%{0,0,1}-0.31}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.27}}under¯ start_ARG 26.08 end_ARG start_POSTSUBSCRIPT - 0.31 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.27 end_POSTSUPERSCRIPT45.330.31+0.19superscriptsubscript45.330.310.19\textbf{45.33}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{%0,0,1}-0.31}}^{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{%1,0,0}+0.19}}45.33 start_POSTSUBSCRIPT - 0.31 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 0.19 end_POSTSUPERSCRIPT

Method R1@0.3 R1@0.5 R1@0.7 mIoUEMTMw/oKAUsubscriptEMTM𝑤𝑜𝐾𝐴𝑈\text{EMTM}_{w/oKAU}EMTM start_POSTSUBSCRIPT italic_w / italic_o italic_K italic_A italic_U end_POSTSUBSCRIPT72.4757.6338.5852.18EMTM3,3,3subscriptEMTM333\text{EMTM}_{3,3,3}EMTM start_POSTSUBSCRIPT 3 , 3 , 3 end_POSTSUBSCRIPT71.0856.9937.8252.06EMTM72.7057.9139.8053.00

4.5 Ablation Studies

In this part, we perform ablation studies to analyze the effectiveness of the EMTM. All experiments are performed three times with different random seeds.

4.5.1 Effects of Components

In our proposed framework, we design the sharing encoders (SE) to learn shallow knowledge from the teacher by label distillation(LD). To better reflect the effects of these two main components, we measure the performance of different combinations.As table 4 and 5 show, each interaction component has a positive effect on the TSGV task. On Charades-STA, the full model outperforms w/o SE by gains of 1.44% on metrics “R@1, IoU=0.7” and exceeds “w/o LD” byon the all metrics. Besides, the full model also outperforms “w/o SE-LD” by a large margin on all metrics. Similarly, our full model has made significant improvements in all metrics compared with the variant "EMTM w/o SE-LD" on ActivityNet.

Additionally, we conduct two ablation experiments of KAU for analysis in table 6. As shown, KAU with the kernel size 3, 5, 7 instead of 3, 3, 3 does work.

4.5.2 Effect of Number of Teacher Models

We investigate the influence of different numbers of teacher models on Charades-STA. As shown in Figure 4, the performance presents a rising tendency with the increase of teachers.According to the results, we realize that our improvements are not only from soft targets with one single teacher, but also from the learning of structural knowledge and intermediate-level knowledge with fused multi-teacher teaching. Multiple teachers make knowledge distillation more flexible, while ensemble helps improve the training of students and transfer related information of examples to them.

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (5)

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (6)

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (7)
Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (8)

4.5.3 Effect of Different Degree of Lightweight Models

We evaluate the influence of different degrees of lightweight models by adjusting their hidden dimension d𝑑ditalic_d on Charades-STA. As shown in Figure 5, obviously as d𝑑ditalic_d decreases, the FLOPs and model parameter size will decline, which would also reduce the performance. From 128 to 64 for d𝑑ditalic_d, both R1@0.7 and mIoU reduce by about 5%, while FLOPs and model parameter size drop by a small margin. For the trade-offs, we select 128 as the hidden dimension.

4.6 Qualitative Analysis

Two samples of prediction on Charades-STA are depicted in Figure 6.The first sample indicates our approach can refine the predictions when the basic model already obtained satisfactory results.The second sample shows the basic model tends to predict the boundary position, possibly due to its limited understanding of the video.As a result, the model relies on biased positional information to make moment predictions.However, utilizing a shared encoder and label distillation approach can provide additional information that enables the model to more precisely predict the moment boundary.

5 Conclusion

In this paper, we focus on the efficiency of the model on TSVG and try to expand the efficiency interval to cover the entire model.A knowledge distillation framework (EMTM) is proposed, which utilizes label distillation from multiple teachers and a shared encoder strategy.In the future, we will pay attention to video feature extraction in TSGV, which is also a time-consume process.We will propose an end-to-end model that input video frames.

References

  • CabaHeilbron etal. [2015]Fabian CabaHeilbron, Victor Escorcia, Bernard Ghanem, and Juan CarlosNiebles.Activitynet: A large-scale video benchmark for human activity understanding.In CVPR, pages 961–970, 2015.
  • Carreira and Zisserman [2017]Joao Carreira and Andrew Zisserman.Quo vadis, action recognition? a new model and the kinetics dataset.In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  • Dong and Yin [2022]Jianxiang Dong and Zhaozheng Yin.Boundary-aware temporal sentence grounding with adaptive proposal refinement.In Proceedings of the Asian Conference on Computer Vision, pages 3943–3959, 2022.
  • f*ckuda etal. [2017]Takashi f*ckuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran.Efficient knowledge distillation from an ensemble of teachers.In Interspeech, pages 3697–3701, 2017.
  • Gao and Xu [2021]Junyu Gao and Changsheng Xu.Fast video moment retrieval.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1523–1532, 2021.
  • Gao etal. [2017]Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia.Tall: Temporal activity localization via language query.In Proceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017.
  • Hinton etal. [2015]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015.
  • Huang etal. [2022]Jiabo Huang, Hailin Jin, Shaogang Gong, and Yang Liu.Video activity localisation with uncertainties in temporal boundary.In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIV, pages 724–740. Springer, 2022.
  • Li etal. [2021]Zheng Li, Jingwen Ye, Mingli Song, Ying Huang, and Zhigeng Pan.Online knowledge distillation for efficient pose estimation.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11740–11750, 2021.
  • Liu etal. [2022]Jihao Liu, Boxiao Liu, Hongsheng Li, and Yu Liu.Meta knowledge distillation.arXiv preprint arXiv:2202.07940, 2022.
  • Pennington etal. [2014]Jeffrey Pennington, Richard Socher, and ChristopherD Manning.Glove: Global vectors for word representation.In EMNLP, pages 1532–1543, 2014.
  • Regneri etal. [2013]Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal.Grounding action descriptions in videos.ACL, 1:25–36, 2013.
  • Romero etal. [2014]Adriana Romero, Nicolas Ballas, SamiraEbrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio.Fitnets: Hints for thin deep nets.arXiv preprint arXiv:1412.6550, 2014.
  • Sigurdsson etal. [2016]GunnarA Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta.Hollywood in homes: Crowdsourcing data collection for activity understanding.In ECCV, pages 510–526, 2016.
  • Wang and Yoon [2021]Lin Wang and Kuk-Jin Yoon.Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • Wu etal. [2022]Ziyue Wu, Junyu Gao, Shucheng Huang, and Changsheng Xu.Learning commonsense-aware moment-text alignment for fast video temporal grounding.arXiv preprint arXiv:2204.01450, 2022.
  • Yang and Wu [2022]Shuo Yang and Xinxiao Wu.Entity-aware and motion-aware transformers for language-driven action localization.In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, LD Raedt, Ed, pages 1552–1558, 2022.
  • You etal. [2017]Shan You, Chang Xu, Chao Xu, and Dacheng Tao.Learning from multiple teacher networks.In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 1285–1294, New York, NY, USA, 2017. Association for Computing Machinery.
  • Yu etal. [2018]AdamsWei Yu, David Dohan, Quoc Le, Thang Luong, Rui Zhao, and Kai Chen.Fast and accurate reading comprehension by combining self-attention and convolution.In International conference on learning representations, 2018.
  • Yuan etal. [2019]Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu.Semantic conditioned dynamic modulation for temporal sentence grounding in videos.Advances in Neural Information Processing Systems, 32, 2019.
  • Zeiler and Fergus [2014]MatthewD Zeiler and Rob Fergus.Visualizing and understanding convolutional networks.In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 818–833. Springer, 2014.
  • Zhang etal. [2020a]Hao Zhang, Aixin Sun, Wei Jing, and JoeyTianyi Zhou.Span-based localizing network for natural language video localization.arXiv preprint arXiv:2004.13931, 2020a.
  • Zhang etal. [2021]Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, JoeyTianyi Zhou, and Rick SiowMong Goh.Parallel attention network with sequence matching for video grounding.arXiv preprint arXiv:2105.08481, 2021.
  • Zhang etal. [2023]H. Zhang, A. Sun, W. Jing, and J. Zhou.Temporal sentence grounding in videos: A survey and future directions.IEEE Transactions on Pattern Analysis &amp; Machine Intelligence, (01):1–20, 2023.
  • Zhang etal. [2020b]Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo.Learning 2d temporal adjacent networks for moment localization with natural language.In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12870–12877, 2020b.
  • Zheng etal. [2022]Minghang Zheng, Yanjie Huang, Qingchao Chen, Yuxin Peng, and Yang Liu.Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15555–15564, 2022.
Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation (2024)
Top Articles
Patriots' Belichick Dubs Raiders' Adams 'Hall of Famer'
Ottolenghi's Chickpea Cooking Method Recipe - Food.com
Funny Roblox Id Codes 2023
Golden Abyss - Chapter 5 - Lunar_Angel
Www.paystubportal.com/7-11 Login
Joi Databas
DPhil Research - List of thesis titles
Shs Games 1V1 Lol
Evil Dead Rise Showtimes Near Massena Movieplex
Steamy Afternoon With Handsome Fernando
fltimes.com | Finger Lakes Times
Detroit Lions 50 50
18443168434
Newgate Honda
Zürich Stadion Letzigrund detailed interactive seating plan with seat & row numbers | Sitzplan Saalplan with Sitzplatz & Reihen Nummerierung
Grace Caroline Deepfake
978-0137606801
Chile Crunch Original
Teenleaks Discord
Immortal Ink Waxahachie
Craigslist Free Stuff Santa Cruz
Mflwer
Spergo Net Worth 2022
Costco Gas Foster City
Obsidian Guard's Cutlass
Mccain Agportal
Amih Stocktwits
Fort Mccoy Fire Map
Uta Kinesiology Advising
Kcwi Tv Schedule
What Time Does Walmart Auto Center Open
Nesb Routing Number
Olivia Maeday
Random Bibleizer
10 Best Places to Go and Things to Know for a Trip to the Hickory M...
Receptionist Position Near Me
Black Lion Backpack And Glider Voucher
Gopher Carts Pensacola Beach
Duke University Transcript Request
Lincoln Financial Field, section 110, row 4, home of Philadelphia Eagles, Temple Owls, page 1
Jambus - Definition, Beispiele, Merkmale, Wirkung
Ark Unlock All Skins Command
Craigslist Red Wing Mn
Jail View Sumter
Birmingham City Schools Clever Login
Thotsbook Com
Funkin' on the Heights
Caesars Rewards Loyalty Program Review [Previously Total Rewards]
Vci Classified Paducah
Www Pig11 Net
Ty Glass Sentenced
Latest Posts
Article information

Author: Gov. Deandrea McKenzie

Last Updated:

Views: 6189

Rating: 4.6 / 5 (46 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Gov. Deandrea McKenzie

Birthday: 2001-01-17

Address: Suite 769 2454 Marsha Coves, Debbieton, MS 95002

Phone: +813077629322

Job: Real-Estate Executive

Hobby: Archery, Metal detecting, Kitesurfing, Genealogy, Kitesurfing, Calligraphy, Roller skating

Introduction: My name is Gov. Deandrea McKenzie, I am a spotless, clean, glamorous, sparkling, adventurous, nice, brainy person who loves writing and wants to share my knowledge and understanding with you.