Fine-grained knowledge about manipulable objects is well-predicted by contrastive language image pre-training

Blog Article

Summary: Object recognition is an important ability that relies on distinguishing between similar objects (e.g., deciding which utensil(s) to use at different stages of meal preparation).Recent work describes the fine-grained organization of knowledge about manipulable objects via the study of the constituent dimensions that are most relevant to human behavior, for here example, vision, manipulation, and function-based properties.

A logical extension of this work concerns whether or not these dimensions are uniquely human, or can be approximated by deep learning.Here, we show that behavioral dimensions are generally well-predicted by CLIP-ViT - a multimodal network trained on a large and diverse set of image-text pairs.Moreover, this model outperforms comparison networks pre-trained on smaller, image-only datasets.These results demonstrate the impressive capacity of CLIP-ViT to approximate fine-grained object knowledge.

We discuss the possible sources of this benefit relative click here to other models (e.g., multimodal vs.image-only pre-training, dataset size, architecture).

Report this page

FINE-GRAINED KNOWLEDGE ABOUT MANIPULABLE OBJECTS IS WELL-PREDICTED BY CONTRASTIVE LANGUAGE IMAGE PRE-TRAINING

Fine-grained knowledge about manipulable objects is well-predicted by contrastive language image pre-training

Fine-grained knowledge about manipulable objects is well-predicted by contrastive language image pre-training

Blog Article

Comments

Unique visitors

Report page

Contact Us