Abstract
Advances in computer vision enable recognition capabilities that surpass human levels, but human-like high-dimensional perceptual induction remain challenging and largely unsolved. Among tremendous formats of higher-level visual reasoning tasks, the Raven’s Progressive Matrices (RPM) is one of the most representative ways to measure human reasoning capabilities. Empirically, neural networks are used to solve RPMs. However, current solution models reason based on training with positive answers, whereas the negative answers do not assist reasoning. Reasoning is commonly achieved by mapping the 8 contexts to 1 answer, whereas RPMs are often solved by inducing consistent row-wise or column-wise rules. Finally, the perception modules often neglect global attributes, while the reasoning modules often seek for appearance similarities other than relation analogies.To address the research questions, we firstly propose Candidate Answer Morphological Mixup (CAM-Mix), a data augmentation method with pixelwise operations that convexly combines candidate answers and mixes up the correct answers with negative answers. The generated samples are semantically similar but different from the correct answers, which could define an accurate boundary around the space spanned by the correct answers.
Secondly, we propose a Two-stage Rule-Induction Visual Reasoner (TRIVR) that handles complex and diversified visual reasoning tasks using a two-stage framework. In contrast to the existing “8+1” formulation, we formulate the problem as three groups of “2+1” image sets for each problem panel row-wisely, for uncovering the consistent rules embedded in the three groups. Additionally, an RPM-like Video Prediction (RVP) dataset is constructed to evaluate the proposed method in real-world applications.
Thirdly, a hybrid ConViT structure is proposed, where a convolutional block is designed to capture the low-level visual attributes, and an additional transformer block is designed to capture the high-level image semantics. Furthermore, we design an Attention-based Relational Reasoner (ARR) that dynamically learns the combination of rules applied to attributes across rows/columns. The designed element-wise attention in ARR better models the non-linear relations in image attributes, which uncovers wider range of compositional relational rules in inductive reasoning.
Finally, a Hierarchical Perceptual and Predictive Analogical Inference (HP2AI) framework is proposed that utilizes the first two feature sets of each row to predict the third feature set through a prediction network that employs shared network weights for modeling consistent underlying rules embedded in the three rows. To address the challenge of numerous possible inferences could be drawn from three sets of features, the adopted SE blocks enhance the salience of attributes and rules critical to solving the problem.
In summary, with the four proposed reasoning model, we have enhanced the visual reasoning from different aspects: answer space modeling, rule induction, multi-scale and multi-level relational reasoning and analogical inference by predicting-and-verifying. We have achieved new state-of-the-art visual reasoning performance on public benchmarks, significantly improving the existed baseline performances of ∼50% to the current state-of-the-art of ∼98%, which largely surpasses the human-level performances.
Date of Award | Nov 2024 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Jianfeng Ren (Supervisor), Ruibin Bai (Supervisor) & Tieyan Liu (Supervisor) |
Keywords
- Abstract Visual Reasoning
- Raven's Progressive Matrix
- Intelligence Quotient Test
- Transformer
- Attention