TY - GEN
T1 - Retaining Semantics in Image to Music Conversion
AU - Xiong, Zeyu
AU - Lin, Pei Chun
AU - Farjudian, Amin
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - We propose a method for generating music from a given image through three stages of translation, from image to caption, caption to lyrics, and lyrics to instrumental music, which forms the content to be combined with a given style. We train our proposed model, which we call BGT (BLIP-GPT2-TeleMelody), on two open-source datasets, one containing over 200,000 labeled images, and another containing more than 175,000 MIDI music files. In contrast with pixel level translation, the BGT model retains the semantics of the input image. We verify our claim through a user study in which participants were asked to match input images with generated music without access to the intermediate caption and lyrics. The results show that, while the matching rate among participants with low music expertise is essentially random, the rate among those with composition experience is significantly high, which strongly indicates that some semantic content of the input image is retained in the generated music.
AB - We propose a method for generating music from a given image through three stages of translation, from image to caption, caption to lyrics, and lyrics to instrumental music, which forms the content to be combined with a given style. We train our proposed model, which we call BGT (BLIP-GPT2-TeleMelody), on two open-source datasets, one containing over 200,000 labeled images, and another containing more than 175,000 MIDI music files. In contrast with pixel level translation, the BGT model retains the semantics of the input image. We verify our claim through a user study in which participants were asked to match input images with generated music without access to the intermediate caption and lyrics. The results show that, while the matching rate among participants with low music expertise is essentially random, the rate among those with composition experience is significantly high, which strongly indicates that some semantic content of the input image is retained in the generated music.
KW - machine learning
KW - media composition
KW - media semantics
UR - http://www.scopus.com/inward/record.url?scp=85147542391&partnerID=8YFLogxK
U2 - 10.1109/ISM55400.2022.00051
DO - 10.1109/ISM55400.2022.00051
M3 - Conference contribution
AN - SCOPUS:85147542391
T3 - Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022
SP - 228
EP - 235
BT - Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 24th IEEE International Symposium on Multimedia, ISM 2022
Y2 - 5 December 2022 through 7 December 2022
ER -