> ## Documentation Index
> Fetch the complete documentation index at: https://docs.dify.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Multimodal Tool

> Configure a tool plugin to emit images, audio, or video so the Knowledge Base node can embed multimodal outputs alongside text

In knowledge pipelines, the Knowledge Base node supports input in two multimodal data formats: `multimodal-Parent-Child` and `multimodal-General`.

For the Knowledge Base node to recognize and embed a tool plugin's multimodal output (such as text, images, audio, or video), complete two configurations:

* **In the tool code file**: Call the tool session interface to upload files and construct the `files` object.
* **In the tool provider YAML file**: Declare the `output_schema` as either `multimodal-Parent-Child` or `multimodal-General`.

## Upload Files and Construct File Objects

When processing multimodal data such as images, first upload the file through Dify's tool session to obtain the file metadata.

The following example, taken from the official **Dify Extractor** plugin, shows how to upload a file and construct a `files` object.

```python theme={null}
# Upload the file using the tool session
file_res = self._tool.session.file.upload(
    file_name,   # filename
    file_blob,   # file binary data
    mime_type,   # MIME type, e.g., "image/png"
)

# Generate a Markdown image reference using the file preview URL
image_url = f"![image]({file_res.preview_url})"
```

The upload interface returns an `UploadFileResponse` object containing the file information:

```python theme={null}
from enum import Enum
from pydantic import BaseModel

class UploadFileResponse(BaseModel):
    class Type(str, Enum):
        DOCUMENT = "document"
        IMAGE = "image"
        VIDEO = "video"
        AUDIO = "audio"

        @classmethod
        def from_mime_type(cls, mime_type: str):
            if mime_type.startswith("image/"):
                return cls.IMAGE
            if mime_type.startswith("video/"):
                return cls.VIDEO
            if mime_type.startswith("audio/"):
                return cls.AUDIO
            return cls.DOCUMENT
    id: str
    name: str
    size: int
    extension: str
    mime_type: str
    type: Type | None = None
    preview_url: str | None = None
```

Map the file information (`name`, `size`, `extension`, `mime_type`, and so on) to the `files` field in the multimodal output structure.

<CodeGroup>
  ```json multimodal_parent_child_structure highlight={22-62} expandable theme={null}
  {
      "$id": "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json",
      "$schema": "http://json-schema.org/draft-07/schema#",
      "version": "1.0.0",
      "type": "object",
      "title": "Multimodal Parent-Child Structure",
      "description": "Schema for multimodal parent-child structure (v1)",
      "properties": {
          "parent_mode": {
          "type": "string",
          "description": "The mode of parent-child relationship"
          },
          "parent_child_chunks": {
          "type": "array",
          "items": {
              "type": "object",
              "properties": {
              "parent_content": {
                  "type": "string",
                  "description": "The parent content"
              },
              "files": {
                  "type": "array",
                  "items": {
                  "type": "object",
                  "properties": {
                      "name": {
                      "type": "string",
                      "description": "file name"
                      },
                      "size": {
                      "type": "number",
                      "description": "file size"
                      },
                      "extension": {
                      "type": "string",
                      "description": "file extension"
                      },
                      "type": {
                      "type": "string",
                      "description": "file type"
                      },
                      "mime_type": {
                      "type": "string",
                      "description": "file mime type"
                      },
                      "transfer_method": {
                      "type": "string",
                      "description": "file transfer method"
                      },
                      "url": {
                      "type": "string",
                      "description": "file url"
                      },
                      "related_id": {
                      "type": "string",
                      "description": "file related id"
                      }
                  },
                  "required": ["name", "size", "extension", "type", "mime_type", "transfer_method", "url", "related_id"]
                  },
                  "description": "List of files"
              },
              "child_contents": {
                  "type": "array",
                  "items": {
                  "type": "string"
                  },
                  "description": "List of child contents"
              }
              },
              "required": ["parent_content", "child_contents"]
          },
          "description": "List of parent-child chunk pairs"
          }
      },
      "required": ["parent_mode", "parent_child_chunks"]
  }
  ```

  ```json multimodal_general_structure highlight={18-56} expandable theme={null}
  {
      "$id": "https://dify.ai/schemas/v1/multimodal_general_structure.json",
      "$schema": "http://json-schema.org/draft-07/schema#",
      "version": "1.0.0",
      "type": "array",
      "title": "Multimodal General Structure",
      "description": "Schema for multimodal general structure (v1) - array of objects",
      "properties": {
          "general_chunks": {
          "type": "array",
          "items": {
              "type": "object",
              "properties": {
              "content": {
                  "type": "string",
                  "description": "The content"
              },
              "files": {
                  "type": "array",
                  "items": {
                  "type": "object",
                  "properties": {
                      "name": {
                      "type": "string",
                      "description": "file name"
                      },
                      "size": {
                      "type": "number",
                      "description": "file size"
                      },
                      "extension": {
                      "type": "string",
                      "description": "file extension"
                      },
                      "type": {
                      "type": "string",
                      "description": "file type"
                      },
                      "mime_type": {
                      "type": "string",
                      "description": "file mime type"
                      },
                      "transfer_method": {
                      "type": "string",
                      "description": "file transfer method"
                      },
                      "url": {
                      "type": "string",
                      "description": "file url"
                      },
                      "related_id": {
                      "type": "string",
                      "description": "file related id"
                      }
                  },
                  "description": "List of files"
              }
              }
              },
              "required": ["content"]
          },
          "description": "List of content and files"
          }
      }
  }
  ```
</CodeGroup>

## Declare Multimodal Output Structure

Dify's official JSON schemas define the structure of multimodal data.

To let the Knowledge Base node recognize the plugin's multimodal output type, point the `result` field under `output_schema` in the plugin's provider YAML file to the corresponding official schema URL.

```yaml theme={null}
output_schema:
  type: object
  properties:
    result:
      # multimodal-Parent-Child
      $ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
      
      # multimodal-General
      # $ref: "https://dify.ai/schemas/v1/multimodal_general_structure.json"
```

For example, a complete YAML configuration using `multimodal-Parent-Child` looks like this:

```yaml expandable theme={null}
identity:
  name: multimodal_tool
  author: langgenius
  label:
    en_US: multimodal tool
    zh_Hans: 多模态提取器
    pt_BR: multimodal tool
description:
  human:
    en_US: Process documents into multimodal-Parent-Child chunk structures
    zh_Hans: 将文档处理为多模态父子分块结构
    pt_BR: Processar documentos em estruturas de divisão pai-filho
  llm: Processes documents into hierarchical multimodal-Parent-Child chunk structures

parameters:
  - name: input_text
    human_description:
      en_US: The text you want to chunk.
      zh_Hans: 输入文本
      pt_BR: Conteúdo de Entrada
    label:
      en_US: Input Content
      zh_Hans: 输入文本
      pt_BR: Conteúdo de Entrada
    llm_description: The text you want to chunk.
    required: true
    type: string
    form: llm

output_schema:
  type: object
  properties:
    result:
      $ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
extra:
  python:
    source: tools/parent_child_chunk.py
```
