Build Tool Plugins for Multimodal Data Processing in Knowledge Pipelines

In knowledge pipelines, the Knowledge Base node supports input in two multimodal data formats: multimodal-Parent-Child and multimodal-General. When developing a tool plugin for multimodal data processing, to ensure that the plugin’s multimodal output (such as text, images, audio, video, etc.) can be correctly recognized and embedded by the Knowledge Base node, you need to complete the following configuration:

In the tool code file, call the tool session interface to upload files and construct the files object.
In the tool provider YAML file, declare the output_schema as either multimodal-Parent-Child or multimodal-General.

Upload Files and Construct File Objects

When processing multimodal data (such as images), you need to first upload the file using Dify’s tool session tool to obtain the file metadata. The following example uses the official Dify plugin, Dify Extractor, to demonstrate how to upload a file and construct a files object.

# Upload the file using the tool session
file_res = self._tool.session.file.upload(
    file_name,   # filename
    file_blob,   # file binary data
    mime_type,   # MIME type, e.g., "image/png"
)

# Generate a Markdown image reference using the file preview URL
image_url = f"![image]({file_res.preview_url})"

The upload interface returns an UploadFileResponse object containing the file information. Its structure is as follows:

    from enum import Enum
    from pydantic import BaseModel

    class UploadFileResponse(BaseModel):
        class Type(str, Enum):
            DOCUMENT = "document"
            IMAGE = "image"
            VIDEO = "video"
            AUDIO = "audio"

            @classmethod
            def from_mime_type(cls, mime_type: str):
                if mime_type.startswith("image/"):
                    return cls.IMAGE
                if mime_type.startswith("video/"):
                    return cls.VIDEO
                if mime_type.startswith("audio/"):
                    return cls.AUDIO
                return cls.DOCUMENT
        id: str
        name: str
        size: int
        extension: str
        mime_type: str
        type: Type | None = None
        preview_url: str | None = None

You can map the file information (such as name, size, extension, mime_type, etc.) to the files field in the multimodal output structure.

{
    "$id": "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json",
    "$schema": "http://json-schema.org/draft-07/schema#",
    "version": "1.0.0",
    "type": "object",
    "title": "Multimodal Parent-Child Structure",
    "description": "Schema for multimodal parent-child structure (v1)",
    "properties": {
        "parent_mode": {
        "type": "string",
        "description": "The mode of parent-child relationship"
        },
        "parent_child_chunks": {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
            "parent_content": {
                "type": "string",
                "description": "The parent content"
            },
            "files": {
                "type": "array",
                "items": {
                "type": "object",
                "properties": {
                    "name": {
                    "type": "string",
                    "description": "file name"
                    },
                    "size": {
                    "type": "number",
                    "description": "file size"
                    },
                    "extension": {
                    "type": "string",
                    "description": "file extension"
                    },
                    "type": {
                    "type": "string",
                    "description": "file type"
                    },
                    "mime_type": {
                    "type": "string",
                    "description": "file mime type"
                    },
                    "transfer_method": {
                    "type": "string",
                    "description": "file transfer method"
                    },
                    "url": {
                    "type": "string",
                    "description": "file url"
                    },
                    "related_id": {
                    "type": "string",
                    "description": "file related id"
                    }
                },
                "required": ["name", "size", "extension", "type", "mime_type", "transfer_method", "url", "related_id"]
                },
                "description": "List of files"
            },
            "child_contents": {
                "type": "array",
                "items": {
                "type": "string"
                },
                "description": "List of child contents"
            }
            },
            "required": ["parent_content", "child_contents"]
        },
        "description": "List of parent-child chunk pairs"
        }
    },
    "required": ["parent_mode", "parent_child_chunks"]
}

Declare Multimodal Output Structure

The structure of multimodal data is defined by Dify’s official JSON schema. To enable the Knowledge Base node to recognize the plugin’s multimodal output type, you need to point the result field under output_schema in the plugin’s provider YAML file to the corresponding official schema URL.

output_schema:
  type: object
  properties:
    result:
      # multimodal-Parent-Child
      $ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
      
      # multimodal-General
      # $ref: "https://dify.ai/schemas/v1/multimodal_general_structure.json"

Taking multimodal-Parent-Child as an example, a complete YAML configuration is as follows:

identity:
  name: multimodal_tool
  author: langgenius
  label:
    en_US: multimodal tool
    zh_Hans: 多模态提取器
    pt_BR: multimodal tool
description:
  human:
    en_US: Process documents into multimodal-Parent-Child chunk structures
    zh_Hans: 将文档处理为多模态父子分块结构
    pt_BR: Processar documentos em estruturas de divisão pai-filho
  llm: Processes documents into hierarchical multimodal-Parent-Child chunk structures

parameters:
  - name: input_text
    human_description:
      en_US: The text you want to chunk.
      zh_Hans: 输入文本
      pt_BR: Conteúdo de Entrada
    label:
      en_US: Input Content
      zh_Hans: 输入文本
      pt_BR: Conteúdo de Entrada
    llm_description: The text you want to chunk.
    required: true
    type: string
    form: llm

output_schema:
  type: object
  properties:
    result:
      $ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
extra:
  python:
    source: tools/parent_child_chunk.py

Getting Started

Features & Specs

Development Guides & Walkthroughs

Publishing

Build Tool Plugins for Multimodal Data Processing in Knowledge Pipelines

Upload Files and Construct File Objects

Declare Multimodal Output Structure

Getting Started

Features & Specs

Development Guides & Walkthroughs

Publishing

​Upload Files and Construct File Objects

​Declare Multimodal Output Structure

Upload Files and Construct File Objects

Declare Multimodal Output Structure