pip install moviepy
pip install imageio_ffmpeg
pip install webvtt-py
! git clone https://github.com/hyunbinui/VMT-all-at-once.git
create 'original_video' & 'original_subs' directory inside 'data' directory for Youtube videos and subtitles
mkdir original_video mkdir original_subs
(recommendation) get video ids from playlist
# playlist → youtube ids → txt file youtube-dl --get-id [playlist link] -i >> list.txt
download videos / subtitles (en-ko) from youtube by using youtube-dl
youtube-dl -a list.txt -o '/target_directory/original_video/%(id)s.%(ext)s' --rm-cache-dir --write-srt --sub-lang en,ko -o '/target_directory/original_subs/%(id)s.%(ext)s'
if youtube-dl is way too slow, try using yt-dlp for downloading videos
yt-dlp -a list.txt -o '/target_directory/original_video/%(id)s.%(ext)s' -S ext:mp4:m4a -i youtube-dl -a list.txt --write-srt --sub-lang en,ko -o '/target_directory/original_subs/%(id)s.%(ext)s' --skip-download -i
- construct the text pair and video dataset by running the create_dataset.py file in 'data' directory
python create_dataset.py --idpath ./list.txt
- cf. text_data.json annotation format
{ 'YouTubeID_StartTime_EndTime': { 'ko' : 'Parallel Korean Caption', 'en' : 'Parallel English Caption'}, ... }
: most VMT models do not have internal video feature extractor. we need to extract video features ourselves and use them as an input. we need our own VIDEO FEATURE EXTRACTOR !
pip install imageio
pip install --upgrade mxnet
pip install --upgrade gluoncv
- you've already done it, right ?
- extract video features using the Inception-v1 I3D model pretrained on Kinetics 400 dataset and save them as .npy files. each video would be represented as a numpy array of size (1, num_of_segments, 1024).
python action_feature_extractor.py
- some VMT models (i.e., DEAR) take video action labels as an input. we could create action labels also by using pretrained I3D model.
python action_label_extractor.py
- cf. action_labels.json annotation format
{ 'YouTubeID_StartTime_EndTime': [19, 17, 191, 171, 97], ... }
├── original_video
│ ├── YouTubeID1.mp4
│ ├── YouTubeID2.mp4
│ └── .....
├── original_subs
│ ├── YouTubeID1.ko.vtt
│ ├── YouTubeID1.en.vtt
│ ├── YouTubeID2.ko.vtt
│ ├── YouTubeID2.en.vtt
│ └── .....
├── dataset
│ ├── video_data
│ │ ├── YouTubeID_StartTime_EndTime.mp4
│ │ ├── YouTubeID_StartTime_EndTime.mp4
│ │ └── .....
│ │
│ ├── action_features
│ │ ├── YouTubeID_StartTime_EndTime.npy
│ │ ├── YouTubeID_StartTime_EndTime.npy
│ │ └── .....
│ │
│ ├── action_labels.json
│ └── text_data.json
├── list.txt
├── utils.py
└── create_dataset.py
DEAR(Dual-lEvel bAck-tRanslation) is a model that investigated video-guided machine translation(VMT) task via dual-level back-translation. To be specific, it introduced sentence-level back-translation along with concept-level back-translation and implemented multi-pattern joint learning to improve translation performance.
If you want to know more about DEAR, check out the official website and this post.
+ There are two DEAR folders in this repo. To clarify, 'DEAR' folder contains en ↔ zh translation model, while 'DEAR_ko contains ko ↔ en translation model.
DEAR takes three elements as an input ; parallel sentence pairs, video action features, and video action labels.
You've already created them, right ?
- basics
: python==3.6+ recommended. I used python 3.9.12
pytorch1.0.0+ recommended. I used torch==1.7.1+cu110pip install torch
- install konlpy + MeCab (for DEAR_ko)
: this may be troublesome, but our friend google is always there for you. good luck !# konlpy sudo apt-get install g++ openjdk-8-jdk python3-dev python3-pip curl # install Java 1.8 or up python3 -m pip install --upgrade pip python3 -m pip install konlpy
# MeCab sudo apt-get install curl git bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh) # if apt-get update fails due to NO_PUBKEY error, run the following code and try again sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC
python train_circle.py
- Chen, S., Zeng, Y., Cao, D., & Lu, S. (2022). Video-guided machine translation via dual-level back-translation. Knowledge-Based Systems, 245, 108598.
- https://kbs-2021.wixsite.com/dear
VRET(Visual Relationship-Enhanced Transformer) is a model that investigated video-guided machine translation(VMT) task via visual relationship-enhanced transformer by constructing a semantic–visual relational graph as a cross-modal bridge. To be specific, graph convolutional network was deployed to capture the relationship among visual semantics to improve translation performance.
If you want to know more about VRET, check out the official website and this post.
+ There are two VRET folders in this repo. To clarify, 'VRET' folder contains tr → en translation model, while 'VRET_ko contains ko → en translation model.
VRET takes three elements as an input ; parallel sentence pairs, scene nodes, and scene graphs.
You've already created parallel sentence pairs and corresponding video clips, right ? Along with parallel sentence pairs, you additionally need to extract scene nodes and scene graphs from videos. Follow this repo to extract scene nodes and scene graphs.
: python 3.6+ recommended. I used python 3.9.12
torch==1.0.0+ recommended. I used torch==1.7.1+cu110 -
install konlpy + MeCab (for VRET_ko)
: this may be troublesome, but our friend google is always there for you. good luck !# konlpy sudo apt-get install g++ openjdk-8-jdk python3-dev python3-pip curl # install Java 1.8 or up python3 -m pip install --upgrade pip python3 -m pip install konlpy
# MeCab sudo apt-get install curl git bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh) # if apt-get update fails due to NO_PUBKEY error, run the following code and try again sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC
install TrTokenizer (for VRET)
pip install trtokenizer
python train.py