> ## Documentation Index
> Fetch the complete documentation index at: https://docs.dify.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Extractor

> Extract text content from uploaded documents for AI processing

The Document Extractor node converts uploaded files into text that LLMs can process. Since language models can't directly read document formats like PDF or DOCX, this node serves as the essential bridge between file uploads and AI analysis.

<Frame caption="Document Extractor Node Configuration">
  ![Document Extractor Node Configuration](https://assets-docs.dify.ai/dify-enterprise-mintlify/en/guides/workflow/node/f3853b40904e275da895711107e9c72f.png)
</Frame>

## Supported File Types

The node handles most text-based document formats:

**Text Documents** - TXT, Markdown, HTML files with direct text content

**Office Documents** - DOCX files from Microsoft Word and compatible applications

**PDF Documents** - Text-based PDFs using pypdfium2 for accurate text extraction

**Office Files** - DOC files require Unstructured API, DOCX files support direct parsing with table extraction converted to Markdown format

**Spreadsheets** - Excel (.xls/.xlsx) and CSV files converted to Markdown tables

**Presentations** - PowerPoint (.ppt/.pptx) files processed via Unstructured API

**Email Formats** - EML and MSG files for email content extraction

**Specialized Formats** - EPUB books, VTT subtitles, JSON/YAML data, and Properties files

Files containing primarily binary content like images, audio, or video require specialized processing tools or external services.

## Input and Output

### Input Configuration

Configure the node to accept either:

**Single File** input from a file variable (typically from the Start node)

**Multiple Files** as an array for batch document processing

### Output Structure

The node outputs extracted text content:

* Single file input produces a `string` containing the extracted text
* Multiple file input produces an `array[string]` with each file's content

The output variable is named `text` and contains the raw text content ready for downstream processing.

## Implementation Example

Here's a complete document Q\&A workflow using the Document Extractor:

<Frame caption="ChatPDF-style Workflow Implementation">
  ![ChatPDF-style Workflow Implementation](https://assets-docs.dify.ai/dify-enterprise-mintlify/en/guides/workflow/node/f6ea094b30b240c999a4248d1fc21a1c.png)
</Frame>

### Workflow Setup

**File Upload Configuration** - Enable file input in your Start node to accept document uploads from users.

**Text Extraction** - Connect the Document Extractor to process uploaded files and extract their text content.

**AI Processing** - Use the extracted text in LLM prompts for analysis, summarization, or question answering.

<Frame caption="Document Processing in Action">
  ![Document Processing in Action](https://assets-docs.dify.ai/dify-enterprise-mintlify/en/guides/workflow/node/83bca46bcde07069660ff649e5c7cf4c.png)
</Frame>

<Frame caption="Chat Interface with Document Upload">
  ![Chat Interface with Document Upload](https://assets-docs.dify.ai/dify-enterprise-mintlify/en/guides/workflow/node/d05301438e8aab7393bb5863554f1009.png)
</Frame>

## Processing Considerations

The Document Extractor uses specialized parsing libraries optimized for different file formats. It preserves text structure and formatting where possible, making extracted content more useful for LLM processing.

### File Format Processing

**Encoding Detection** - Uses chardet library to automatically detect file encoding with UTF-8 fallback for text-based files

**Table Conversion** - Excel and CSV data becomes Markdown tables for better LLM comprehension

**Document Structure** - DOCX files maintain paragraph and table ordering with proper table-to-Markdown conversion

**Multi-line Content** - VTT subtitle files merge consecutive utterances by the same speaker

### External Dependencies

Some file formats require the **Unstructured API** service configured via `UNSTRUCTURED_API_URL` and `UNSTRUCTURED_API_KEY`:

* DOC files (legacy Word documents)
* PowerPoint presentations (if using API processing)
* EPUB books (if using API processing)

For very large documents, consider the LLM's context limits and implement chunking strategies if needed. The extracted text maintains the original document's logical structure to preserve meaning and context.
