Demo based on example Qwen2.5-VL spatial notebook for detecting foods and drinks in images with bounding boxes. Input an image of food/drink for bounding boxes to be detected. If no food is present in an image the model should return 'no foods found'.
One prediction will use thinking tags, e.g. ... to try an describe what's in the image. The other will directly predict a JSON of bounding box coordinates and labels.
Boxes may not be as accurate as a dedicated object detection model but the benefit here is that they are class agnostic (e.g. the model can detect a wide range of items despite never being explicitly trained on them).
The foundation knowledge in Qwen2.5-VL (we are using Qwen2.5-VL-7B-Instruct in this demo) means it can detect a wide range of foods and drinks.
See the app.py file for the different prompts used.