Progress in multimodal foundation models has been astonishing in the past few years and allow to equip robots with world knowledge of scenes, objects, and human activities. Robots should then be able to perceive and act upon the sensed world, be it that current solutions require data diversity, task circumstances, and the label vocabulary all to be pre-defined, stationary and controlled. As soon as these ‘closed world’ deep learning assumptions are broken, perceptual understanding suffers and oftentimes catastrophically. Hence, robots equipped with state-of-the-art multimodal perceptual skills will experience great difficulty generalizing to perception tasks in an open world where sensory and semantic conditions will differ considerably from those perceived during training.