Scripts for filtering AVSpeech data using a vision and language transformer. The videos were filtered to select samples with high audio-visual correspondence. The filtered data was used for Self-Supervised Visual-Acoustic Matching.
Note: Github often has issues rendering python notebooks, so the analysis notebook can also be viewed here