Multimodal capabilities

The LLM Mesh provides multimodal capabilities to handle:

  • Image inputs. Images can be mixed with text in LLM queries, in order to answer queries like “please describe what is in this image”

  • Image outputs. The LLM Mesh can generate images.

Image input

Supported models

Multimodal input with images is supported with the following providers:

  • OpenAI (GPT-4o)

  • Azure OpenAI (GPT-4o)

  • Vertex Gemini Pro

  • Bedrock Claude 3

  • Bedrock Claude 3.5

  • Local HuggingFace models

API

Image input is available in the LLM Mesh API.

For more details, please see LLM Mesh

Answers

Image input is available in Dataiku Answers

Image output

Supported models

Image generation is supported with the following providers:

  • OpenAI (DALL-E 3)

  • Azure OpenAI (DALL-E 3)

  • Google Vertex (Imagen 3 and Imagen 3 Fast)

  • Stability AI

  • Bedrock Titan Image Generator

  • Bedrock Stability AI models

  • Local HuggingFace models

Image-to-image support

Bedrock Amazon Titan Image Generator G1 supports image-to-image with the following image edition modes:

  • Image-to-image with prompt

  • Image-to-image without prompt

  • Image-to-image inpainting

    • Black mask mode

Bedrock Stability AI SDXL 1.0 also supports image-to-image with the following image edition modes:

  • Image-to-image with prompt

  • Image-to-image inpainting

    • Black mask mode

    • Transparent original image mode

Stability AI Stable Diffusion 3 Large supports image-to-image mode with prompt.

The Stability AI connection also supports CONTROLNET_SKETCH and CONTROLNET_STRUCTURE image edition modes.

API

Image output is available through the LLM Mesh API. Note that some parameters are not supported by all providers / models.

For more details, please see LLM Mesh