TY - GEN
T1 - Paper Folding Puzzles
T2 - 40th AAAI Conference on Artificial Intelligence, AAAI 2026
AU - Zhou, Dibin
AU - Xu, Yantao
AU - Huang, Zongming
AU - Yan, Zengwei
AU - Liu, Wenhao
AU - Miao, Yongwei
AU - Ren, Jianfeng
AU - Liu, Fuchang
N1 - Publisher Copyright:
© 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2026
Y1 - 2026
N2 - Multimodal Large Language Models (MLLMs) largely lag human-level performance on abstract visual reasoning (AVR), which requires models to infer latent rules from visual question sets and generalize them to novel scenarios. Most AVR benchmarks are constrained to narrow and repetitive 2D patterns, involving relatively simple spatial relationships and assessing limited dimensions of reasoning ability. Drawing inspiration from real-world paper folding challenges, we propose Paper Folding Puzzles (PFP), a rigorously designed benchmark specifically developed to assess spatial reasoning capabilities. It comprises 150K visual question-answering samples across five diverse tasks, ranging from basic 2D geometric reasoning to 3D spatial understanding. The developed benchmark dataset can be employed to assess core spatial reasoning abilities essential to human cognition, encompassing fundamental symmetry reasoning and 3D spatial comprehension. Furthermore, we conduct a comprehensive evaluation of 18 leading MLLMs (both closed-and open-source vari-ants) on the PFP benchmark to assess their spatial reasoning capabilities. Our findings show that most MLLMs achieve near-chance performance on FPF, exhibiting substantial performance gaps (> 30%) relative to human baselines across all tasks. This highlights a critical research gap in improving spatial reasoning capabilities of MLLMs.
AB - Multimodal Large Language Models (MLLMs) largely lag human-level performance on abstract visual reasoning (AVR), which requires models to infer latent rules from visual question sets and generalize them to novel scenarios. Most AVR benchmarks are constrained to narrow and repetitive 2D patterns, involving relatively simple spatial relationships and assessing limited dimensions of reasoning ability. Drawing inspiration from real-world paper folding challenges, we propose Paper Folding Puzzles (PFP), a rigorously designed benchmark specifically developed to assess spatial reasoning capabilities. It comprises 150K visual question-answering samples across five diverse tasks, ranging from basic 2D geometric reasoning to 3D spatial understanding. The developed benchmark dataset can be employed to assess core spatial reasoning abilities essential to human cognition, encompassing fundamental symmetry reasoning and 3D spatial comprehension. Furthermore, we conduct a comprehensive evaluation of 18 leading MLLMs (both closed-and open-source vari-ants) on the PFP benchmark to assess their spatial reasoning capabilities. Our findings show that most MLLMs achieve near-chance performance on FPF, exhibiting substantial performance gaps (> 30%) relative to human baselines across all tasks. This highlights a critical research gap in improving spatial reasoning capabilities of MLLMs.
UR - https://www.scopus.com/pages/publications/105034998900
U2 - 10.1609/aaai.v40i16.38364
DO - 10.1609/aaai.v40i16.38364
M3 - Conference contribution
AN - SCOPUS:105034998900
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
T3 - Proceedings of the AAAI Conference on Artificial Intelligence
SP - 13584
EP - 13592
BT - Proceedings of the AAAI Conference on Artificial Intelligence
A2 - Koenig, Sven
A2 - Jenkins, Chad
A2 - Taylor, Matthew E.
PB - Association for the Advancement of Artificial Intelligence
Y2 - 20 January 2026 through 27 January 2026
ER -