CMT reduces the training cost of diffusion-based flow map models by up to 90% while reaching SOTA performance
ICLR26
A framework for Identify which training examples influenced specific concepts within the diffusion model
ICLR26
Improved object-centric diffusion learning with registers and contrastive alignment
ICLR26
An improved mechanism for applying classifier-free guidance in discrete diffusion
ICLR26
Learning conditional, unconditional, and matching-aware discriminator with adaptive weighting mechanism (cSAN)
ICLR26
Propose tensor-decomposition-based PEFT method, showing its effectiveness on T-to-I generation tasks
ICCV25
Theoretical analysis of limitation of current discrete diffusion and a method for effectively capturing element-wise dependency
ICML25
Improving Consistency Training with a learned data-noise coupling
ICML25
Classifier-Free Guidance inside the Attraction Basin May Cause Memorization
CVPR25
A general method to find an optimal sampling schedule for inference in discrete diffusion
ICLR25
A method efficiently leverages online human feedback to fine-tune Stable Diffusion for various range of tasks
ICLR25
An enhanced multimodal representation using weighted point clouds and its theoretical benefits
ICLR25
A 64x64 pre-trained diffusion model is all you need for 1-step high-resolution SOTA generation
NeurIPS24
Unified framework enables diverse samplers and 1-step generation SOTAs
ICLR24
Applications:
[SoundGen]
Enhancing GAN with metrizable discriminators
ICLR24
Applications:
[Vocoder]
Fast, Efficient, Training-Free, and Controllable diffusion-based generation method
ICLR24
Generalizing hierarchical VQ-VAEs with a Bayesian framework
TMLR
Improving density estimation of diffusion
ICML23
Improving codebook utilization and training stability
ICML22
Mitigating oversmoothness in VAE
Neurocomputing
DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning
EMNLP25
CARE: Assessing the Impact of Multilingual Human Preference Learning on Cultural Awareness
EMNLP25
Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association
ICCV25 MRR Workshop
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
TMLR
Cross-Modal Learning for Music-to-Music-Video Description Generation
NAACL25 RepL4NLP Workshop
VinaBench: Benchmark for Faithful and Consistent Visual Narratives
CVPR25
penMU: Your Swiss Army Knife for Music Understanding
ISMIR2024 Late Breaking Demos
DiffuCOMET: Contextual Commonsense Knowledge Diffusion
ACL24
On the Language Encoder of Contrastive Cross-modal Models
ACL24
Few-shot Dialogue Strategy Learning for Motivational Interviewing via Inductive Reasoning
ACL24
PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives (Outstanding Paper Award)
ACL23
ComFact: A Benchmark for Linking Contextual Commonsense Knowledge
EMNLP22 Findings
Tool Calling For Music Post-Production
ICLR26
Automatic music mixing using a generative model of effect embeddings
ICASSP26
Automatic Music Sample Identification with Multi-Track Contrastive Learning
ICASSP26
Leveraging Whisper Embeddings for Audio-based Lyrics Matching
ICASSP26
Training Data Attribution
Large-Scale Training Data Attribution for Music Generative Models via Unlearning
NeurIPS25 Creative AI
Reductive, exclusionary, normalising: The limits of generative AI music
TISMIR
Can Large Language Models Predict Audio Effects Parameters from Natural Language?
WAASPA25
Vocal Effects Style Transfer
Inference-Time Optimisation for Vocal Effects Style Transfer using DiffVox
WAASPA25
SOTA Fx representation: Extract instrument-wise audio effects representations from music mixtures
ISMIR25
Inference Time Optimization for Music Mastering Style Transfer
ISMIR25
Reverse Engineering of Music Mixing Graphs with Differentiable Processors and Iterative Pruning
JAES
Supervised contrastive learning from weakly-labeled audio segments for musical version matching
ICML25
Music Foundation Model as Generic Booster for Music Downstream Tasks
TMLR
DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions
DAFx25
VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression
ICASSP25
Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer
ICASSP25
Searching For Music Mixing Graphs: A Pruning Approach
DAFx24
Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data
DAFx24
MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models
IJCAI24
Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription
ICASSP24
VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance
ICASSP24
Automatic Piano Transcription with Hierarchical Frequency-Time Transformer
ISMIR23
An Attention-based Approach To Hierarchical Multi-label Music Instrument Classification
ICASSP23
Unsupervised Vocal Dereverberation with Diffusion-based Generative Models
ICASSP23
Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects
ICASSP23
DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability
ICASSP23
Hierarchical Diffusion Models for Singing Voice Neural Vocoder
ICASSP23
Distortion Effect Removal
Distortion Audio Effects: Learning How to Recover the Clean Signal
ISMIR22
Automatic Music Mixing with Deep Learning and Out-of-Domain Data
ISMIR22
Music Source Separation with Deep Equilibrium Models
ICASSP22
Automatic DJ Transitions with Differentiable Audio Effects and Generative Adversarial Networks
ICASSP22
Robust One-Shot Singing Voice Conversion
Glenn Gould and Kanji Ishimaru 2021: A collaboration with AI Sound Separation after 60 years
VIRTUE: Visual-Interactive Text-Image Universal Embedder
ICLR26
CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
ACMMM25
TITAN-Guide: Taming Inference-Time AligNment for Guided Text-to-Video Diffusion Models
ICCV25
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
CVPR25
MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation
ICLR25
SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation
ICLR25
Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models
ICLR25
GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping
NeurIPS24
SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond
ISMIR24
Hearing Anything Anywhere
CVPR24
BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network
ICASSP24
Zero- and Few-shot Sound Event Localization and Detection
ICASSP24
STARSS23
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
NeurIPS23
Audio Restoration: ViT-AE
Extending Audio Masked Autoencoders Toward Audio Restoration
WASPAA23
Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement
INTERSPEECH23
CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos
ICLR23
Sound Event Localization and Detection
Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training
ICASSP22