
"The past decade marked a new era of Computer Vision 2.0, boldly supported by the power, flexibility, and efficiency of Deep Learning. With the arrival of Large Language Models, however, we might be entering very soon a third era of Computer Vision 3.0, with algorithms that interact and reason in the physical space rather than on desk laptops, powered by foundational components trained on very large-scale data. Large language models have been so successful, in my opinion, because with their scale efficiency, i) they were able to bootstrap on the structure that is already present in the textual data, ii) they were able to interact with dialogue, and iii) they were able to operate on near open-world settings. Importantly, they have been so successful not in their capacity for accurate prediction but in their uncanny capacity for generalization. In contrast to language, in vision structures are also present in the data but latent and must be inferred and thus priors must be explicitly included. In this talk, I will make a case for how dynamics, causality, geometry, and physical simulations are the necessary priors for learning successfully foundational embodied AI algorithms, and present recent works in this direction.
"
Associate Professor at University of Amsterdam