Create Knowledge Base & Upload Documents

Steps to upload documents into Knowledge:

  1. Select the document you need to upload from your local files;

  2. Segment and clean the document, and preview the effect;

  3. Choose and configure Index Mode and Retrieval Settings;

  4. Wait for the chunks to be embedded;

  5. Upload completed, now you can use it in your applications 🎉

1 Creating a Knowledge Base

Click on Knowledge in the main navigation bar of Dify. On this page, you can see your existing knowledge bases. Click Create Knowledge to enter the setup wizard:

  • Drag and drop or select files to upload. The number of files allowed for batch upload depends on your subscription plan;

  • If you have not prepared any documents yet, you can first create an empty knowledge base;

  • When creating a knowledge base with an external data source (such as Notion or Sync from website), the knowledge base type becomes immutable. This restriction prevents management complexities that could arise from multiple data sources within a single knowledge base.

    For scenarios requiring multiple data sources, we recommend creating separate knowledge bases for each source. You can then utilize the Multiple-Retrieval feature to reference multiple knowledge bases within the same application.

Limitations for uploading documents:

  • The upload size limit for a single document is 15MB;

  • Different subscription plans for the SaaS version limit batch upload numbers, total document uploads, and vector storage;


2 Text Preprocessing and Cleaning

After uploading content to the knowledge base, it needs to undergo chunking and data cleaning. This stage can be understood as content preprocessing and structuring.

What is text chunking and cleaning?

Chunking: LLMs have a limited context window, usually requiring the entire text to be segmented and then recalling the most relevant segments to the user’s question, known as the segment TopK recall mode. Additionally, appropriate segment sizes help match the most relevant text content and reduce information noise when semantically matching user questions with text segments.

Cleaning: To ensure the quality of text recall, it is usually necessary to clean the data before passing it into the model. For example, unwanted characters or blank lines in the output may affect the quality of the response. To help users solve this problem, Dify provides various cleaning methods to help clean the output before sending it to downstream applications, check ETL to know more details.

Two strategies are supported:

  • Automatic mode

  • Custom mode

Automatic

The Automated mode is designed for users unfamiliar with segmentation and preprocessing techniques. In this mode, Dify automatically segments and sanitizes content files, streamlining the document preparation process.


3 Indexing Mode

You need to choose the indexing method for the text to specify the data matching method. The indexing strategy is often related to the retrieval method, and you need to choose the appropriate retrieval settings according to the scenario.

  • High-Quality Mode

  • Economical Mode

  • Q&A Mode

In High-Quality mode, the system first leverages an configurable Embedding model (which can be switched) to convert chunk text into numerical vectors. This process facilitates efficient compression and persistent storage of large-scale textual data, while simultaneously enhancing the accuracy of LLM-user interactions.

The High-Quality indexing method offers three retrieval settings: vector retrieval, full-text retrieval, and hybrid retrieval. For more details on retrieval settings, please check "Retrieval Settings".


4 Retrieval Settings

In high-quality indexing mode, Dify offers three retrieval settings:

  • Vector Search

  • Full-Text Search

  • Hybrid Search

Vector Search

Definition: The system vectorizes the user's input query to generate a query vector. It then computes the distance between this query vector and the text vectors in the knowledge base to identify the most semantically proximate text chunks.

Vector Search Settings:

Rerank Model: After configuring the API key for the Rerank model on the "Model Provider" page, you can enable the “Rerank Model” in the retrieval settings. The system will then perform semantic reordering of the retrieved document results after hybrid retrieval, optimizing the ranking results. Once the Rerank model is established, the TopK and Score Threshold settings will only take effect during the reranking step.

TopK: This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.

Score Threshold: This parameter sets the similarity threshold for filtering text chucks. Only text chucks that exceed the specified score will be recalled. By default, this setting is off, meaning there will be no filtering of similarity values for recalled text chucks. When enabled, the default value is 0.5. A higher value is likely to yield fewer recalled texts.

The TopK and Score configurations are only effective during the Rerank phase. Therefore, to apply either of these settings, it is necessary to add and enable a Rerank model.

In the Economical indexing mode, Dify offers a single retrieval setting:

Inverted Index:

An inverted index is an index structure designed for rapid keyword retrieval in documents. Its fundamental principle involves mapping keywords from documents to lists of documents containing those keywords, thereby enhancing search efficiency. For a detailed explanation of the underlying mechanism, please refer to the "Inverted Index".

TopK:

This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.


Reference

Optional ETL Configuration

In production-level applications of RAG, to achieve better data recall, multi-source data needs to be preprocessed and cleaned, i.e., ETL (extract, transform, load). To enhance the preprocessing capabilities of unstructured/semi-structured data, Dify supports optional ETL solutions: Dify ETL and Unstructured ETL.

Unstructured can efficiently extract and transform your data into clean data for subsequent steps.

ETL solution choices in different versions of Dify:

  • The SaaS version defaults to using Unstructured ETL and cannot be changed;

  • The community version defaults to using Dify ETL but can enable Unstructured ETL through environment variables;

Differences in supported file formats for parsing:

DIFY ETLUnstructured ETL

txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv

txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv, eml, msg, pptx, ppt, xml, epub

Different ETL solutions may have differences in file extraction effects. For more information on Unstructured ETL’s data processing methods, please refer to the official documentation.

Embedding Model

Embedding transforms discrete variables (words, sentences, documents) into continuous vector representations, mapping high-dimensional data to lower-dimensional spaces. This technique preserves crucial semantic information while reducing dimensionality, enhancing content retrieval efficiency.

Embedding models, specialized large language models, excel at converting text into dense numerical vectors, effectively capturing semantic nuances for improved data processing and analysis.

Last updated