Retaining Semantics in Image to Music Conversion

Zeyu Xiong; Pei Chun Lin; Amin Farjudian

doi:10.1109/ISM55400.2022.00051

Retaining Semantics in Image to Music Conversion

Zeyu Xiong, Pei Chun Lin, Amin Farjudian

School of Computer Science

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

3 Citations (Scopus)

Abstract

We propose a method for generating music from a given image through three stages of translation, from image to caption, caption to lyrics, and lyrics to instrumental music, which forms the content to be combined with a given style. We train our proposed model, which we call BGT (BLIP-GPT2-TeleMelody), on two open-source datasets, one containing over 200,000 labeled images, and another containing more than 175,000 MIDI music files. In contrast with pixel level translation, the BGT model retains the semantics of the input image. We verify our claim through a user study in which participants were asked to match input images with generated music without access to the intermediate caption and lyrics. The results show that, while the matching rate among participants with low music expertise is essentially random, the rate among those with composition experience is significantly high, which strongly indicates that some semantic content of the input image is retained in the generated music.

Original language	English
Title of host publication	Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	228-235
Number of pages	8
ISBN (Electronic)	9781665471725
DOIs	https://doi.org/10.1109/ISM55400.2022.00051
Publication status	Published - 2022
Event	24th IEEE International Symposium on Multimedia, ISM 2022 - Virtual, Online, Italy Duration: 5 Dec 2022 → 7 Dec 2022

Publication series

Name	Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022

Conference

Conference	24th IEEE International Symposium on Multimedia, ISM 2022
Country/Territory	Italy
City	Virtual, Online
Period	5/12/22 → 7/12/22

Keywords

machine learning
media composition
media semantics

ASJC Scopus subject areas

Artificial Intelligence
Computer Science Applications
Signal Processing
Media Technology

Access to Document

10.1109/ISM55400.2022.00051

Cite this

@inproceedings{0685a6e843b54e90b777acffa3750b26,

title = "Retaining Semantics in Image to Music Conversion",

abstract = "We propose a method for generating music from a given image through three stages of translation, from image to caption, caption to lyrics, and lyrics to instrumental music, which forms the content to be combined with a given style. We train our proposed model, which we call BGT (BLIP-GPT2-TeleMelody), on two open-source datasets, one containing over 200,000 labeled images, and another containing more than 175,000 MIDI music files. In contrast with pixel level translation, the BGT model retains the semantics of the input image. We verify our claim through a user study in which participants were asked to match input images with generated music without access to the intermediate caption and lyrics. The results show that, while the matching rate among participants with low music expertise is essentially random, the rate among those with composition experience is significantly high, which strongly indicates that some semantic content of the input image is retained in the generated music.",

keywords = "machine learning, media composition, media semantics",

author = "Zeyu Xiong and Lin, {Pei Chun} and Amin Farjudian",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 24th IEEE International Symposium on Multimedia, ISM 2022 ; Conference date: 05-12-2022 Through 07-12-2022",

year = "2022",

doi = "10.1109/ISM55400.2022.00051",

language = "English",

series = "Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "228--235",

booktitle = "Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022",

address = "United States",

}

Xiong, Z, Lin, PC & Farjudian, A 2022, Retaining Semantics in Image to Music Conversion. in Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022. Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022, Institute of Electrical and Electronics Engineers Inc., pp. 228-235, 24th IEEE International Symposium on Multimedia, ISM 2022, Virtual, Online, Italy, 5/12/22. https://doi.org/10.1109/ISM55400.2022.00051

Retaining Semantics in Image to Music Conversion. / Xiong, Zeyu; Lin, Pei Chun; Farjudian, Amin.
Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022. Institute of Electrical and Electronics Engineers Inc., 2022. p. 228-235 (Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Retaining Semantics in Image to Music Conversion

AU - Xiong, Zeyu

AU - Lin, Pei Chun

AU - Farjudian, Amin

PY - 2022

Y1 - 2022

N2 - We propose a method for generating music from a given image through three stages of translation, from image to caption, caption to lyrics, and lyrics to instrumental music, which forms the content to be combined with a given style. We train our proposed model, which we call BGT (BLIP-GPT2-TeleMelody), on two open-source datasets, one containing over 200,000 labeled images, and another containing more than 175,000 MIDI music files. In contrast with pixel level translation, the BGT model retains the semantics of the input image. We verify our claim through a user study in which participants were asked to match input images with generated music without access to the intermediate caption and lyrics. The results show that, while the matching rate among participants with low music expertise is essentially random, the rate among those with composition experience is significantly high, which strongly indicates that some semantic content of the input image is retained in the generated music.

AB - We propose a method for generating music from a given image through three stages of translation, from image to caption, caption to lyrics, and lyrics to instrumental music, which forms the content to be combined with a given style. We train our proposed model, which we call BGT (BLIP-GPT2-TeleMelody), on two open-source datasets, one containing over 200,000 labeled images, and another containing more than 175,000 MIDI music files. In contrast with pixel level translation, the BGT model retains the semantics of the input image. We verify our claim through a user study in which participants were asked to match input images with generated music without access to the intermediate caption and lyrics. The results show that, while the matching rate among participants with low music expertise is essentially random, the rate among those with composition experience is significantly high, which strongly indicates that some semantic content of the input image is retained in the generated music.

KW - machine learning

KW - media composition

KW - media semantics

UR - http://www.scopus.com/inward/record.url?scp=85147542391&partnerID=8YFLogxK

U2 - 10.1109/ISM55400.2022.00051

DO - 10.1109/ISM55400.2022.00051

M3 - Conference contribution

AN - SCOPUS:85147542391

T3 - Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022

SP - 228

EP - 235

BT - Proceedings - 2022 IEEE International Symposium on Multimedia, ISM 2022

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 24th IEEE International Symposium on Multimedia, ISM 2022

Y2 - 5 December 2022 through 7 December 2022

ER -

Retaining Semantics in Image to Music Conversion

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this