Skip to content

katha-ai/RecapStorySumm-CVPR2024

Repository files navigation

📑 Contents

  1. About
  2. Setting up the repository
    1. Create a virtual environment
    2. Update the config template
  3. Feature Extraction
  4. Downloading and Setting up the data directories
    1. PlotSnap features
    2. TaleSumm pre-trained weights
  5. Train TaleSumm with different configurations
  6. Inference on TaleSumm to create summaries
  7. License
  8. Bibtex

🤖About

This is the official code repository for CVPR-2024 accepted paper "Previously on ..." From Recaps to Story Summarization. This repository contains the implementation of TaleSumm, a Transformer-based hierarchical model on our proposed dataset PlotSnap. TaleSumm processes entire episodes by creating compact shot 🎞️ and dialog 🗣️ representations, and predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Our model leverages multiple modalities, including visual and dialog features, to capture a comprehensive understanding of important shots in complex movie environments. Additionally, we provide the pre-trained weights for the TaleSumm as well as all the pre-trained feature backbones used in feature extraction. On top of that, we provide pre-extracted features for episodes (per-frame embeddings using DenseNet169, CLIP, and MViT), and dialog features (with finetuned RoBERTa backbone).

⚙️Setting up the repository

🐍Create a Python-virtual environment

  1. Clone the repository and change the working directory to be project's root.

    $ git clone https://github.com/katha-ai/RecapStorySumm-CVPR2024
    $ cd RecapStorySumm-CVPR2024
  2. This project strictly requires python==3.8.

    Create a virtual environment using Conda.

    $ conda create -n storysumm python=3.8
    $ conda activate storysumm
    (storysumm) $ pip install -r requirements.txt

    OR

    Create a virtual environment using pip (make sure you have Python3.8 installed)

    $ python3.8 -m pip install virtualenv
    $ python3.8 -m virtualenv storysumm
    $ source storysumm/bin/activate
    (storysumm) $ pip install -r requirements.txt

🛠️Configure the configs/base.yaml file

  1. Add the absolute paths to the project directory in configs/base.yaml

  2. E.g., If you have cloned the repository at /home/user/RecapStorySumm-CVPR2024, and want to download model checkpoints and the data features, then the path variables in configs/base.yaml would be-

    root: "/home/user/RecapStorySumm-CVPR2024"
    # Save PlotSnap data features here
    data_path: "${root}/data"
    split_dir: "${root}/configs/data_configs/splits"
    # To save dialog (and vision) backbones
    cache_dir: "${root}/cache/"
    # use the following for model checkpoints 
    ckpt_path: "${root}/checkpoints/storysumm"

Refer to configs/trainer_config.yaml and configs/inference_config.yaml for the default parameter configuration while training and inferencing, respectively.

🔍Feature Extraction

Follow the instructions in feature_extractors/README.md [WIP] to extract required features from any given video and prepare it summarization.

Note that we have already provided the pre-extracted features for PlotSnap below.

📥Download

🗃️PlotSnap features

You can also use wget to download these files-

# Download the features (as mentioned below into data/ folder)
LINK="https://iiitaphyd-my.sharepoint.com/:u:/g/personal/makarand_tapaswi_iiit_ac_in/EdEsWTvAEg5Iuo1cAUNmVq4Bipauv5nGdTdXAtMidWR5GA?e=dLWkNo"
wget -O data $LINK
File name Contents Comments
24
  • Contains total of 8 seasons (S02 to S09).
  • Each season then consists of 24 episodes, except S09 that has 12 episodes.
  • Each episode consists of:
    1. encodings/: Consist video and dialog encodings
    2. scores/: Different form of labels for both video and dialog.
    3. videvents/: files that consists of starting and ending timing of constituent shots of recap and episode.
    4. SXXEXX.dfd: Per-frame scores denoting the possibility of shot-boundary
    5. SXXEXX.matidx: per-frame info on shot-index, frame-index, time (seconds:microseconds)
    6. SXXEXX.srt: Dialog File (for visualization)
    7. shot_frames/: 3 frames from each shot.
Contains S02 to S09 directories which will occupy 92GB of disk space.
Prison Break
  • Contains total of 2 seasons (S02 & S03).
  • They consists of 22 and 13 episodes, respectively.
  • The episodes follow the same directory stucture as TV Show 24.
This occupy 22GB of disk space.

🦾TaleSumm pre-trained weights

# Create the checkpoints folder `checkpoints/storysumm` in the project's root folder if not present already and put all checkpoints one-by-one in them.
mkdir -p <absolute_path_to_root>/checkpoints/storysumm

# OR (simply do the following).

# Now download the pre-trained weights (as mentioned below into ckpts/ folder)
LINK="https://iiitaphyd-my.sharepoint.com/:u:/g/personal/makarand_tapaswi_iiit_ac_in/ES91ZF90ArJGiXkEa53-kJABNytKOyOSQlr03dnTf6bKKg?e=PN1Gir"
wget -O checkpoints $LINK
File name Comments Training command
TaleSumm-IntraCVT|S[1,2,3,4,5] IntraCVT split i=0,1,2,3,4 checkpoint of TaleSumm (storysumm) $ python -m trainer split_type='intra-loocv'
TaleSumm-Final Final checkpoint of TaleSumm to be used in production (storysumm) $ python -m trainer split_type='final-split.yaml'

🏋️‍♂️Train

After completing the above, now you can train Talesumm on a 12GB Nvidia-2080 RTX-Ti GPU! You can also use the pre-trained weights provided in the Download section.

Note: It is recommended to use wandb to log & track your experiments

Using the default values given in the config_base.yaml

  1. To train TaleSumm for PlotSnap, use the default config (no argument required)

    (storysumm) $ python -m trainer
  2. To train Talesumm with a specific modality (valid keywords- vid, dia, both)

    (storysumm) $ python -m trainer modality=both
  3. To train Talesumm on a specific series (valid keywords- 24, prison-break, all)

    (storysumm) $ python -m trainer series='24'
  4. To change the split type to be used for training (valid keywords- cross-series, intra-loocv, inter-loocv, default-split.yaml, fandom-split.yaml)

    (storysumm) $ python -m trainer split_type=cross-series
  5. To choose which visual features to train on, create a list of the features to be used (valid keywords- imagenet, mvit, clip)

    (storysumm) $ python -m trainer visual_features=['imagenet','mvit','clip']
  6. To choose the fusion style of the visual features (valid keywords- concat, stack, mul)

    (storysumm) $ python -m trainer feat_fusion_style=concat
  7. To choose the type of attention in the model (valid keywords- sparse, full)

    (storysumm) $ python -m trainer attention_type=sparse
  8. To disable Group tokens from the model

    (storysumm) $ python -m trainer withGROUP=False

    NOTE : If withGROUP is True then computeGROUPloss needs to be True as well

  9. To enable wandb logging (recommended)

    (storysumm) $ python -m trainer wandb.logging=True

NOTE : We have used 4 GPUs while training that is why the gpus parameter in the configuration is set to [0,1,2,3]. If you plan to more or less GPUs, please enter their GPU id's accordingly

Inference

To summarise a new video using Talesumm, please follow the following commands

(storysummm) $ python -m inference <overrides for inference_config.yaml>

NOTE : We have used 4 GPUs while training that is why the gpus parameter in the configuration is set to [0,1,2,3]. If you plan to more or less GPUs, please enter their GPU id's accordingly

📜License

This code is available for non-commercial scientific research purposes as defined in the LICENSE file. By downloading and using this code you agree to the terms in the LICENSE. Third-party datasets and software are subject to their respective licenses.

📍Cite

If you find any part of this repository useful, please cite the following paper!

@inproceedings{singh2024previously,
title={{"Previously on ..." From Recaps to Story Summarization}}, 
author={Aditya Kumar Singh and Dhruv Srivastava and Makarand Tapaswi},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024},
}