Create Knowledge Base & Upload Documents

Steps to upload documents into Knowledge:

  1. Select the document you need to upload from your local files;

  2. Segment and clean the document, and preview the effect;

  3. Choose and configure Index Mode and Retrieval Settings;

  4. Wait for the chunks to be embedded;

  5. Upload completed, now you can use it in your applications 🎉

1 Creating a Knowledge Base

Click on Knowledge in the main navigation bar of Dify. On this page, you can see your existing knowledge bases. Click Create Knowledge to enter the setup wizard:

  • Drag and drop or select files to upload. The number of files allowed for batch upload depends on your subscription plan;

  • If you have not prepared any documents yet, you can first create an empty knowledge base;

  • When creating a knowledge base with an external data source (such as Notion or Sync from website), the knowledge base type becomes immutable. This restriction prevents management complexities that could arise from multiple data sources within a single knowledge base.

    For scenarios requiring multiple data sources, we recommend creating separate knowledge bases for each source. You can then utilize the Multiple-Retrieval feature to reference multiple knowledge bases within the same application.

Limitations for uploading documents:

  • The upload size limit for a single document is 15MB;

  • Different subscription plans for the SaaS version limit batch upload numbers, total document uploads, and vector storage;


2 Text Preprocessing and Cleaning

After uploading content to the knowledge base, it needs to undergo chunking and data cleaning. This stage can be understood as content preprocessing and structuring.

What is text chunking and cleaning?

Chunking: LLMs have a limited context window, usually requiring the entire text to be segmented and then recalling the most relevant segments to the user’s question, known as the segment TopK recall mode. Additionally, appropriate segment sizes help match the most relevant text content and reduce information noise when semantically matching user questions with text segments.

Cleaning: To ensure the quality of text recall, it is usually necessary to clean the data before passing it into the model. For example, unwanted characters or blank lines in the output may affect the quality of the response. To help users solve this problem, Dify provides various cleaning methods to help clean the output before sending it to downstream applications, check ETL to know more details.

Two strategies are supported:

  • Automatic mode

  • Custom mode

Automatic

The Automated mode is designed for users unfamiliar with segmentation and preprocessing techniques. In this mode, Dify automatically segments and sanitizes content files, streamlining the document preparation process.


3 Indexing Mode

You need to choose the indexing method for the text to specify the data matching method. The indexing strategy is often related to the retrieval method, and you need to choose the appropriate retrieval settings according to the scenario.

  • High-Quality Mode

  • Economical Mode

  • Q&A Mode

In High-Quality mode, the system first leverages an configurable Embedding model (which can be switched) to convert chunk text into numerical vectors. This process facilitates efficient compression and persistent storage of large-scale textual data, while simultaneously enhancing the accuracy of LLM-user interactions.

The High-Quality indexing method offers three retrieval settings: vector retrieval, full-text retrieval, and hybrid retrieval. For more details on retrieval settings, please check "Retrieval Settings".


4 Retrieval Settings

In high-quality indexing mode, Dify offers three retrieval settings:

  • Vector Search

  • Full-Text Search

  • Hybrid Search

In the Economical indexing mode, Dify offers a single retrieval setting:

Inverted Index:

An inverted index is an index structure designed for rapid keyword retrieval in documents. Its fundamental principle involves mapping keywords from documents to lists of documents containing those keywords, thereby enhancing search efficiency. For a detailed explanation of the underlying mechanism, please refer to the "Inverted Index".

TopK:

This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.


Reference

Optional ETL Configuration

In production-level applications of RAG, to achieve better data recall, multi-source data needs to be preprocessed and cleaned, i.e., ETL (extract, transform, load). To enhance the preprocessing capabilities of unstructured/semi-structured data, Dify supports optional ETL solutions: Dify ETL and Unstructured ETL.

Unstructured can efficiently extract and transform your data into clean data for subsequent steps.

ETL solution choices in different versions of Dify:

  • The SaaS version defaults to using Unstructured ETL and cannot be changed;

  • The community version defaults to using Dify ETL but can enable Unstructured ETL through environment variables;

Differences in supported file formats for parsing:

DIFY ETLUnstructured ETL

txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv

txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv, eml, msg, pptx, ppt, xml, epub

Different ETL solutions may have differences in file extraction effects. For more information on Unstructured ETL’s data processing methods, please refer to the official documentation.

Embedding Model

Embedding transforms discrete variables (words, sentences, documents) into continuous vector representations, mapping high-dimensional data to lower-dimensional spaces. This technique preserves crucial semantic information while reducing dimensionality, enhancing content retrieval efficiency.

Embedding models, specialized large language models, excel at converting text into dense numerical vectors, effectively capturing semantic nuances for improved data processing and analysis.

Last updated