Progress in multimodal foundation models has been astonishing in the past few years and allow to equip robots with world knowledge of scenes, objects, and human activities. Robots should then be able to perceive and act upon the sensed world, be it that current solutions require data diversity, task circumstances, and the label vocabulary all to be pre-defined, stationary and controlled. As soon as these ‘closed world’ deep learning assumptions are broken, perceptual understanding suffers and oftentimes catastrophically.