In knowledge pipelines, the Knowledge Base node supports input in two multimodal data formats:Documentation Index
Fetch the complete documentation index at: https://docs.dify.ai/llms.txt
Use this file to discover all available pages before exploring further.
multimodal-Parent-Child and multimodal-General.
When developing a tool plugin for multimodal data processing, to ensure that the plugin’s multimodal output (such as text, images, audio, video, etc.) can be correctly recognized and embedded by the Knowledge Base node, you need to complete the following configuration:
-
In the tool code file, call the tool session interface to upload files and construct the
filesobject. -
In the tool provider YAML file, declare the
output_schemaas eithermultimodal-Parent-Childormultimodal-General.
Upload Files and Construct File Objects
When processing multimodal data (such as images), you need to first upload the file using Dify’s tool session tool to obtain the file metadata. The following example uses the official Dify plugin, Dify Extractor, to demonstrate how to upload a file and construct afiles object.
UploadFileResponse object containing the file information. Its structure is as follows:
name, size, extension, mime_type, etc.) to the files field in the multimodal output structure.
Declare Multimodal Output Structure
The structure of multimodal data is defined by Dify’s official JSON schema. To enable the Knowledge Base node to recognize the plugin’s multimodal output type, you need to point theresult field under output_schema in the plugin’s provider YAML file to the corresponding official schema URL.
multimodal-Parent-Child as an example, a complete YAML configuration is as follows: