https://hyper.ai/datasets/8754
AVSpeech is a new, large-scale audio-visual dataset comprising speech video clips with no interfering backgruond noises. The segments are 3-10 seconds long, and in each clip the audible sound in the soundtrack belongs to a single speaking person, visible in the video. In total, the dataset contains roughly 4700 hours of video segments, from a total of 290k YouTube videos, spanning a wide variety of people, languages and face poses. For more details on how we created the dataset see our paper.Previous


领域场景: | 未指定 |
领域问题: | 未指定 |
领域应用: | 未指定 |
应用案例: | 未指定 |