Deep Generative Modeling

Multimodal
NLP

Music
Technology

Cinematic
Technology

Hosted
Challenges

Deep Generative Modeling

CMT

[arXiv]

CMT reduces the training cost of diffusion-based flow map models by up to 90% while reaching SOTA performance

ICLR26

ConceptTRAK

[arXiv]

A framework for Identify which training examples influenced specific concepts within the diffusion model

ICLR26

CODA

[arXiv]

Improved object-centric diffusion learning with registers and contrastive alignment

ICLR26

Improved CFG

[arXiv]

An improved mechanism for applying classifier-free guidance in discrete diffusion

ICLR26

SONA

[arXiv]

Learning conditional, unconditional, and matching-aware discriminator with adaptive weighting mechanism (cSAN)

ICLR26

TLoRA

[arXiv]

Propose tensor-decomposition-based PEFT method, showing its effectiveness on T-to-I generation tasks

ICCV25

Di4C

[arXiv] [code]

Theoretical analysis of limitation of current discrete diffusion and a method for effectively capturing element-wise dependency

ICML25

VCT

[arXiv] [code]

Improving Consistency Training with a learned data-noise coupling

ICML25

Memorization

[arXiv] [code]

Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

CVPR25

Jump Your Steps

[arXiv]

A general method to find an optimal sampling schedule for inference in discrete diffusion

ICLR25

HERO-DM

[arXiv] [demo]

A method efficiently leverages online human feedback to fine-tune Stable Diffusion for various range of tasks

ICLR25

WPSE

[arXiv]

An enhanced multimodal representation using weighted point clouds and its theoretical benefits

ICLR25

PaGoDA

[arXiv]

A 64x64 pre-trained diffusion model is all you need for 1-step high-resolution SOTA generation

NeurIPS24

CTM

[arXiv] [demo]

Unified framework enables diverse samplers and 1-step generation SOTAs

ICLR24

Applications:
[SoundGen]

SAN

[arXiv] [code] [demo]

Enhancing GAN with metrizable discriminators

ICLR24

Applications:
[Vocoder]

MPGD

[arXiv] [demo]

Fast, Efficient, Training-Free, and Controllable diffusion-based generation method

ICLR24

HQ-VAE

[OpenReview] [arXiv]

Generalizing hierarchical VQ-VAEs with a Bayesian framework

TMLR

FP-Diffusion

[PMLR] [code]

Improving density estimation of diffusion

ICML23

GibbsDDRM

[PMLR] [code]

Achieving blind inversion using DDPM

ICML23

Applications:
[DeReverb] [SpeechEnhance]

SQ-VAE

[PMLR] [arXiv] [code]

Improving codebook utilization and training stability

ICML22

AR-ELBO

[Elsevier] [arXiv]

Mitigating oversmoothness in VAE

Neurocomputing

Multimodal NLP

DeepResonance

[EMNLP] [arXiv] [code]

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

EMNLP25

CARE

[EMNLP] [arXiv] [data]

CARE: Assessing the Impact of Multilingual Human Preference Learning on Cultural Awareness

EMNLP25

BiAug

[MRR@ICCV25] [arXiv]

Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

ICCV25 MRR Workshop

GLOV

[TMLR] [arXiv]

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

TMLR

Music-to-MVD

[RepL4NLP@NAACL25] [arXiv]

Cross-Modal Learning for Music-to-Music-Video Description Generation

NAACL25 RepL4NLP Workshop

VinaBench

[CVPR] [arXiv] [data]

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

CVPR25

OpenMU

[arXiv] [data] [demo]

penMU: Your Swiss Army Knife for Music Understanding

ISMIR2024 Late Breaking Demos

DiffuCOMET

[ACL] [arXiv] [code]

DiffuCOMET: Contextual Commonsense Knowledge Diffusion

ACL24

CyCLIPs/CyCLAPs

[ACL] [arXiv]

On the Language Encoder of Contrastive Cross-modal Models

ACL24

DIIR

[ACL] [arXiv] [code]

Few-shot Dialogue Strategy Learning for Motivational Interviewing via Inductive Reasoning

ACL24

PeaCok

[ACL] [arXiv] [code]

PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives
(Outstanding Paper Award)

ACL23

ComFact

[EMNLP] [arXiv] [code]

ComFact: A Benchmark for Linking Contextual Commonsense Knowledge

EMNLP22 Findings

Music Technologies

LLM2Fx-Tools

[arXiv] [demo]

Tool Calling For Music Post-Production

ICLR26

MEGAMI

[arXiv] [code] [demo]

Automatic music mixing using a generative model of effect embeddings

ICASSP26

Sampling Identification

[arXiv] [code]

Automatic Music Sample Identification with Multi-Track Contrastive Learning

ICASSP26

Lyrics Matching

[arXiv] [code]

Leveraging Whisper Embeddings for Audio-based Lyrics Matching

ICASSP26

Training Data Attribution

[arXiv]

Large-Scale Training Data Attribution for Music Generative Models via Unlearning

NeurIPS25 Creative AI

Beyond GenAI Music

[url]

Reductive, exclusionary, normalising: The limits of generative AI music

TISMIR

LLM2FX

[arXiv] [code] [demo] [dataset]

Can Large Language Models Predict Audio Effects Parameters from Natural Language?

WAASPA25

Vocal Effects Style Transfer

[arXiv] [code] [demo]

Inference-Time Optimisation for Vocal Effects Style Transfer using DiffVox

WAASPA25

Fx-Encoder++

[arXiv] [code]

SOTA Fx representation: Extract instrument-wise audio effects representations from music mixtures

ISMIR25

ITO-Master

[arXiv] [code] [demo]

Inference Time Optimization for Music Mastering Style Transfer

ISMIR25

GRAFx (ext.)

[JAES] [code] [demo]

Reverse Engineering of Music Mixing Graphs with Differentiable Processors and Iterative Pruning

JAES

CLEWS

[arXiv]

Supervised contrastive learning from weakly-labeled audio segments for musical version matching

ICML25

MFM as Generic Booster

[OpenReview] [arXiv]

Music Foundation Model as Generic Booster for Music Downstream Tasks

TMLR

DiffVox

[arXiv] [code] [demo] [audio]

DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions

DAFx25

Variable Bitrate RVQ

[arXiv]

VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression

ICASSP25

Instr. Timbre Transfer

[arXiv] [code] [demo]

Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

ICASSP25

Mixing Graph Estimation

[arXiv] [code] [demo]

Searching For Music Mixing Graphs: A Pruning Approach

DAFx24

Guitar Amp. Modeling

[arXiv]

Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data

DAFx24

Text-to-Music Editing

[arXiv] [code] [demo]

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

IJCAI24

Instr.-Agnostic Trans.

[IEEE] [arXiv]

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

ICASSP24

Vocal Restoration

[IEEE] [arXiv] [demo]

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

ICASSP24

hFT-Transformer

[arXiv] [code]

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

ISMIR23

Automatic Music Tagging

[arXiv]

An Attention-based Approach To Hierarchical Multi-label Music Instrument Classification

ICASSP23

Vocal Dereverberation

[arXiv] [demo]

Unsupervised Vocal Dereverberation with Diffusion-based Generative Models

ICASSP23

Mixing Style Transfer

[arXiv] [code] [demo]

Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

ICASSP23

Music Transcription

[arXiv] [code] [demo]

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

ICASSP23

Singing Voice Vocoder

[arXiv] [demo]

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

ICASSP23

Distortion Effect Removal

[poster] [arXiv] [demo]

Distortion Audio Effects: Learning How to Recover the Clean Signal

ISMIR22

Automatic Music Mixing

[poster] [arXiv] [code] [demo]

Automatic Music Mixing with Deep Learning and Out-of-Domain Data

ISMIR22

Sound Separation

[IEEE]

Music Source Separation with Deep Equilibrium Models

ICASSP22

Automatic DJ Transition

[arXiv] [code] [demo]

Automatic DJ Transitions with Differentiable Audio Effects and Generative Adversarial Networks

ICASSP22

Singing Voice Conversion

[arXiv] [demo]

Robust One-Shot Singing Voice Conversion

Sound Separation

[video] [site]

Glenn Gould and Kanji Ishimaru 2021: A collaboration with AI Sound Separation after 60 years

Cinematic Technologies

VIRTUE

[OpenReview] [arXiv] [code] [dataset] [collection]

VIRTUE: Visual-Interactive Text-Image Universal Embedder

ICLR26

CCStereo

[ACM] [arXiv] [code]

CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

ACMMM25

TITAN-Guide

[CVF] [arXiv] [code] [demo]

TITAN-Guide: Taming Inference-Time AligNment for Guided Text-to-Video Diffusion Models

ICCV25

MMAudio

[CVF] [arXiv] [code] [demo]

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

CVPR25

MMDisCo

[OpenReview] [arXiv] [code]

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

ICLR25

SoundCTM

[OpenReview] [arXiv] [code] [demo]

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

ICLR25

Mining Your Own Secrets

[OpenReview] [arXiv]

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

ICLR25

GenWarp

[arXiv] [demo]

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

NeurIPS24

SpecMaskGIT

[arXiv] [demo]

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

ISMIR24

Acoustic Inv. Rendering

[CVF] [arXiv] [dataset] [code] [demo]

Hearing Anything Anywhere

CVPR24

BigVSAN Vocoder

[arXiv] [code] [demo]

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

ICASSP24

Zero-/Few-shot SELD

[IEEE] [arXiv]

Zero- and Few-shot Sound Event Localization and Detection

ICASSP24

STARSS23

[arXiv] [dataset]

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

NeurIPS23

Audio Restoration: ViT-AE

[IEEE] [arXiv] [demo]

Extending Audio Masked Autoencoders Toward Audio Restoration

WASPAA23

Diffiner

[ISCA] [arXiv] [code]

Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

INTERSPEECH23

CLIPSep

[OpenReview] [arXiv] [code] [demo]

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

ICLR23

Sound Event Localization and Detection

[IEEE] [arXiv]

Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

ICASSP22

Hosted Challenges

CPD Challenge 2025

[CPD Challenge 2025]

Commonsense Persona-grounded Dialogue Challenge 2025

SVG Challenge 2024

[SVG Challenge 2024]

Sounding Video Generation Challenge 2024

DCASE Challenge Task 3

[DCASE Challenge2023]

Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes

CPD Challenge 2023

[CPD Challenge 2023]

Commonsense Persona-grounded Dialogue Challenge

SDX Challenge 2023

[site] [paper (music)] [paper (cinematic)]

Sound Demixing Challenge 2023

MDX Challenge 2021

[site] [frontiers]

Music Demixing Challenge 2021

Name		Name	Last commit message	Last commit date
Latest commit History 296 Commits
assets		assets
.README.md.un~		.README.md.un~
.local.css.un~		.local.css.un~
README.md		README.md
_config.yml		_config.yml
local.css		local.css

Folders and files

Latest commit

History

Repository files navigation

Deep Generative Modeling

CMT

ConceptTRAK

CODA

Improved CFG

SONA

TLoRA

Di4C

VCT

Memorization

Jump Your Steps

HERO-DM

WPSE

PaGoDA

CTM

SAN

MPGD

HQ-VAE

FP-Diffusion

GibbsDDRM

SQ-VAE

AR-ELBO

Multimodal NLP

DeepResonance

[EMNLP] [arXiv] [code]

CARE

[EMNLP] [arXiv] [data]

BiAug

[MRR@ICCV25] [arXiv]

GLOV

[TMLR] [arXiv]

Music-to-MVD

VinaBench

[CVPR] [arXiv] [data]

OpenMU

DiffuCOMET

CyCLIPs/CyCLAPs

DIIR

PeaCok

ComFact

Music Technologies

LLM2Fx-Tools

MEGAMI

Sampling Identification

Lyrics Matching

Training Data Attribution

Beyond GenAI Music

LLM2FX

Vocal Effects Style Transfer

Fx-Encoder++

ITO-Master

GRAFx (ext.)

CLEWS

MFM as Generic Booster

DiffVox

Variable Bitrate RVQ

Instr. Timbre Transfer

Mixing Graph Estimation

Guitar Amp. Modeling

Text-to-Music Editing

Instr.-Agnostic Trans.

Vocal Restoration

hFT-Transformer

Automatic Music Tagging

Vocal Dereverberation

Mixing Style Transfer

Music Transcription

Singing Voice Vocoder

Distortion Effect Removal

Automatic Music Mixing

Sound Separation

Automatic DJ Transition

Singing Voice Conversion

Sound Separation

Cinematic Technologies

VIRTUE