From 55ed145dfedeb819c183da58167968fba7b083c3 Mon Sep 17 00:00:00 2001
From: write_math <2799449627xu@gmail.com>
Date: Wed, 10 Apr 2024 03:36:30 +0800
Subject: [PATCH] Update
---
README.md | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/README.md b/README.md
index a859253..540d7d6 100644
--- a/README.md
+++ b/README.md
@@ -28,9 +28,9 @@ This is a curated list of audio-visual learning methods and datasets, based on o
- [Spatial Sound Generation](#spatial-sound-generation)
- [Video Generation](#video-generation)
- [talking face](#talking-face)
- - [Gesture](#gesture)
+ - [Gesture](#gesture)
- [Dance](#dance)
- - [Image Manipulation](#image-manipulation)
+ - [Image Manipulation](#image-manipulation)
- [Depth Estimation](#depth-estimation)
- [Audio-visual Transfer Learning](#audio-visual-transfer-learning)
- [Cross-modal Retrieval](#cross-modal-retrieval)
@@ -776,7 +776,7 @@ Intelligent Networks and Network Security
**Authors:** Yitao Cai, Huiyu Cai, Xiaojun Wan
**Institution:** Peking University
-
+
**[ACL-2020]**
[Sentiment and Emotion help Sarcasm? A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis](https://aclanthology.org/2020.acl-main.401/)
@@ -1693,7 +1693,7 @@ Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan,
**Authors:** Ruohan Gao, Kristen Grauman
**Institution:** The University of Texas at Austin; Facebook AI Research
-
+
**[ICIP-2019]**
[Self-Supervised Audio Spatialization with Correspondence Classifier](https://ieeexplore.ieee.org/abstract/document/8803494/)
@@ -1948,7 +1948,7 @@ Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan,
**Authors:** Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, Hedvig Kjellstr?m
**Institution:** KTH Royal Institute of Technology in Stockholm; Hokkai Gakuen University; Aoyama Gakuin University;
-
+
**[CVPR-2019]**
[Learning Individual Styles of Conversational Gesture](https://openaccess.thecvf.com/content_CVPR_2019/html/Ginosar_Learning_Individual_Styles_of_Conversational_Gesture_CVPR_2019_paper.html)
@@ -2187,7 +2187,7 @@ Tsinghua University; University of Michigan; Shanghai Qi Zhi Institute
**Authors:** Zihui Xue, Sucheng Ren, Zhengqi Gao, Hang Zhao
**Institution:** Shanghai Qi Zhi Institute; UT Austin; South China University of Technology; Massachusetts Institute of Technology; Tsinghua University
-
+
**[CVPR-2021]**
[Distilling Audio-visual Knowledge by Compositional Contrastive Learning](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_Distilling_Audio-Visual_Knowledge_by_Compositional_Contrastive_Learning_CVPR_2021_paper.pdf)
@@ -2305,7 +2305,7 @@ Tsinghua University; University of Michigan; Shanghai Qi Zhi Institute
**Authors:** Donghuo Zeng, Yi Yu, Keizo Oyama
**Institution:** National Institute of Informatics
-
+
**[IEEE TGRS-2020]**
[Deep Cross-Modal Image–Voice Retrieval in Remote Sensing](https://ieeexplore.ieee.org/abstract/document/9044618)
@@ -2911,7 +2911,7 @@ Tsinghua University; University of Michigan; Shanghai Qi Zhi Institute
Xiongkuo Min, Guangtao Zhai, Jiantao Zhou, Xiao-Ping Zhang, Xiaokang Yang, Xinping Guan
**Institution:** Shanghai Jiao Tong University; University of Macau; Ryerson University
-
+
**[IROS-2021]**
[ViNet: Pushing the limits of Visual Modality for Audio-Visuav Saliency Prediction](https://ieeexplore.ieee.org/abstract/document/9635989)
@@ -3538,7 +3538,7 @@ Visual Scene-Aware Dialog](https://ieeexplore.ieee.org/document/10147255)
## Datasets
-| Dataset | Year | Videos | Length | Data form | Video source | Task
+| Dataset | Year | Videos | Length | Data form | Video source | Task |
| :-------------------: | :-------: | :-------: | :-------: | :---------------------------: | :-----------------------: | :-------------------------------------------------------------: |
| [LRW, LRS2 and LRS3](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/) | 2016,2018, 2018 | - | 800h+ | video | in the wild | Speech-related, speaker-related,face generation-related tasks |
| [VoxCeleb, VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html) | 2017, 2018 | - | 2,000h+ | video | YouTube | Speech-related, speaker-related,face generation-related tasks |
@@ -3564,3 +3564,4 @@ Visual Scene-Aware Dialog](https://ieeexplore.ieee.org/document/10147255)
| [Pano-AVQA](https://paperswithcode.com/dataset/visual-question-answering) | 2021 | 5.4k | 7.7h | 360 video with QA | Video-sharing platforms | Audio-visual question answering |
| [MUSIC-AVQA](https://gewu-lab.github.io/MUSIC-AVQA/) | 2022 | 9,288 | 150h+ | video with QA | YouTube | Audio-visual question answering |
| [AVSBench](https://arxiv.org/abs/2207.05042) | 2022 | 5,356 | 14.8h+ | video | YouTube | Audio-visual segmentation, sound localization |
+| [RAF](https://arxiv.org/abs/2403.18821) | 2024 | - | 95h+ | 3D environment | Recorded videos | Spatial Sound Generation |