Last updated
Last updated
After uploading content to the knowledge base, the next step is chunking and data cleaning. This stage involves content preprocessing and structuring, where long texts are divided into multiple smaller chunks.
Whether an LLM can accurately answer knowledge base queries depends on how effectively the system retrieves relevant content chunks. High-relevance chunks are crucial for AI applications to produce precise and comprehensive responses.
In an AI customer chatbot scenario, for example, directing the LLM to the key content chunks in a tool manual is sufficient to quickly answer user questions—no need to repeatedly analyze the entire document. This approach saves tokens during the analysis phase while boosting the overall quality of the AI-generated answers.
The knowledge base supports two chunking modes: General Mode and Parent-child Mode. If you are creating a knowledge base for the first time, it is recommended to choose Parent-Child Mode.
Please note: The original “Automatic Chunking and Cleaning” mode has been automatically updated to “General” mode. No changes are required, and you can continue to use the default setting.
Once the chunk mode is selected and the knowledge base is created, it cannot be changed later. Any new documents added to the knowledge base will follow the same chunking strategy.
Content will be divided into independent chunks. When a user submits a query, the system automatically calculates the relevance between the chunks and the query keywords. The top-ranked chunks are then retrieved and sent to the LLM for processing the answers.
In this mode, you need to manually define text chunking rules based on different document formats or specific scenario requirements. Refer to the following configuration options for guidance:
Maximum chunk length: Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking. The default value is 500 tokens, with a maximum chunk length of 4000 tokens.
Overlapping chunk length: When data is chunked, there is a certain amount of overlap between chunks. This overlap can help to improve the retention of information and the accuracy of analysis, and enhance retrieval effects. It is recommended that the setting be 10-25% of the chunk length Tokens.
Text Preprocessing Rules, Text preprocessing rules help filter out irrelevant content from the knowledge base. The following options are available:
Replace consecutive spaces, newline characters, and tabs
Remove all URLs and email addresses
Once configured, click “Preview Chunk” to see the chunking results. You can view the character count for each chunk. If you modify the chunking rules, click the button again to view the latest generated text chunks.
If multiple documents are uploaded in bulk, you can switch between them by clicking the document titles at the top to review the chunk results for other documents.
Compared to General mode, Parent-child mode uses a two-tier data structure that balances precise retrieval with comprehensive context, combining accurate matching and richer contextual information.
In this mode, parent chunks (e.g., paragraphs) serve as larger text units to provide context, while child chunks (e.g., sentences) focus on pinpoint retrieval. The system searches child chunks first to ensure relevance, then fetches the corresponding parent chunk to supply the full context—thereby guaranteeing both accuracy and a complete background in the final response. You can customize how parent and child chunks are split by configuring delimiters and maximum chunk lengths.
For example, in an AI-powered customer chatbot case, a user query can be mapped to a specific sentence within a support document. The paragraph or chapter containing that sentence is then provided to the LLM, filling in the overall context so the answer is more precise.
Its fundamental mechanism includes:
Query Matching with Child Chunks:
Small, focused pieces of information, often as concise as a single sentence within a paragraph, are used to match the user's query.
These child chunks enable precise and relevant initial retrieval.
Contextual Enrichment with Parent Chunks:
Larger, encompassing sections—such as a paragraph, a section, or even an entire document—that include the matched child chunks are then retrieved.
These parent chunks provide comprehensive context for the Language Model (LLM).
In this mode, you need to manually configure separate chunking rules for both parent and child chunks based on different document formats or specific scenario requirements.
Parent Chunk
The parent chunk settings offer the following options:
Paragraph
This mode splits the text in to paragraphs based on delimiters and the maximum chunk length, using the split text as the parent chunk for retrieval. Each paragraph is treated as a parent chunk, suitable for documents with large volumes of text, clear content, and relatively independent paragraphs. The following settings are supported:
Maximum chunk length: Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking. The default value is 500 tokens, with a maximum chunk length of 4000 tokens.
Full Doc
Instead of splitting the text into paragraphs, the entire document is used as the parent chunk and retrieved directly. For performance reasons, only the first 10,000 tokens of the text are retained. This setting is ideal for smaller documents where paragraphs are interrelated, requiring full doc retrieval.
Child Chunk
Child chunks are derived from parent chunks by splitting them based on delimiter rules. They are used to identify and match the most relevant and direct information to the query keywords. When using the default child chunking rules, the segmentation typically results in the following:
If the parent chunk is a paragraph, child chunks correspond to individual sentences within each paragraph.
If the parent chunk is the full document, child chunks correspond to the individual sentences within the document.
You can configure the following chunk settings:
Maximum chunk length: Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking. The default value is 200 tokens, with a maximum chunk length of 4000 tokens.
You can also use text preprocessing rules to filter out irrelevant content from the knowledge base:
Replace consecutive spaces, newline characters, and tabs
Remove all URLs and email addresses
After completing the configuration, click “Preview Chunks” to view the results. You can see the total character count of the parent chunk.
Once configured, click “Preview Chunk” to see the chunking results. You can see the total character count of the parent chunk. Characters highlighted in blue represent child chunks, and the character count for the current child chunk is also displayed for reference.
If you modify the chunking rules, click the button again to view the latest generated text chunks.
If multiple documents are uploaded in bulk, you can switch between them by clicking the document titles at the top to review the chunk results for other documents.
The difference between the two modes lies in the structure of the content chunks. General Mode produces multiple independent content chunks, whereas Parent-child Mode uses a two-layer chunking approach. In this way, a single parent chunk (e.g., the entire document or a paragraph) contains multiple child chunks (e.g., sentences).
Different chunking methods influence how effectively the LLM can search the knowledge base. When used on the same document, Parent-child Retrieval provides more comprehensive context while maintaining high precision, making it significantly more effective than the traditional single-layer approach.
After choosing the chunking mode, refer to the following documentation to configure the indexing method and retrieval method and finis the creation of your knowledge base.
Chunk identifier, The default value is \n\n
, which means the text will be chunked by paragraphs. You can customize chunking rules using . The system will automatically execute chunking whenever it detects the specified delimiter. For example, \n
means chunk the text by sentences.
After setting the chunking rules, the next step is to specify the indexing method. General mode supports High-Quality Indexing Method and Economical Indexing Method. For more details, please refer to .
Chunk Delimiter: The default value is \n\n
, which chunks text by paragraphs. You can customize the chunking rules using . The system automatically chunks the text whenever the specified delimiter appears.
Chunk Delimiter: The default value is \n
, which chunks text by sentences. You can customize the chunking rules using . The system automatically chunks the text whenever the specified delimiter appears.
To ensure accurate content retrieval, the Parent-child chunk mode only supports the .