Fine-grained knowledge about manipulable objects is well-predicted by contrastive language image pre-training
Fine-grained knowledge about manipulable objects is well-predicted by contrastive language image pre-training
Blog Article
Summary: Object recognition is an important ability that relies on distinguishing between similar objects (e.g., deciding which utensil(s) to use at different stages of meal preparation).Recent work describes the fine-grained organization of knowledge about manipulable objects via the study of the constituent dimensions that are most relevant to human behavior, for here example, vision, manipulation, and function-based properties.
A logical extension of this work concerns whether or not these dimensions are uniquely human, or can be approximated by deep learning.Here, we show that behavioral dimensions are generally well-predicted by CLIP-ViT - a multimodal network trained on a large and diverse set of image-text pairs.Moreover, this model outperforms comparison networks pre-trained on smaller, image-only datasets.These results demonstrate the impressive capacity of CLIP-ViT to approximate fine-grained object knowledge.
We discuss the possible sources of this benefit relative click here to other models (e.g., multimodal vs.image-only pre-training, dataset size, architecture).