What is the Chunking and Cleaning Strategy?
What is the Chunking and Cleaning Strategy?
- Chunking
- Cleaning
Chunk Mode
The knowledge base supports two chunking modes: General Mode and Parent-child Mode. If you are creating a knowledge base for the first time, it is recommended to choose Parent-Child Mode.Please note: The original “Automatic Chunking and Cleaning” mode has been automatically updated to “General” mode. No changes are required, and you can continue to use the default setting.Once the chunk mode is selected and the knowledge base is created, it cannot be changed later. Any new documents added to the knowledge base will follow the same chunking strategy.
General Mode
Content will be divided into independent chunks. When a user submits a query, the system automatically calculates the relevance between the chunks and the query keywords. The top-ranked chunks are then retrieved and sent to the LLM for processing the answers. In this mode, you need to manually define text chunking rules based on different document formats or specific scenario requirements. Refer to the following configuration options for guidance:-
Chunk identifier: The system will automatically execute chunking whenever it detects the specified delimiter. The default value is
\n\n, which means the text will be chunked by paragraphs.
- Maximum chunk length: Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking.
- Overlapping chunk length: When data is chunked, there is a certain amount of overlap between chunks. This overlap can help to improve the retention of information and the accuracy of analysis, and enhance retrieval effects. It is recommended that the setting be 10-25% of the chunk length Tokens.
- Replace consecutive spaces, newline characters, and tabs
- Remove all URLs and email addresses
After setting the chunking rules, the next step is to specify the indexing method. General mode supports High-Quality Indexing Method and Economical Indexing Method. For more details, please refer to Set up the Indexing Method.
Parent-child Mode
Compared to General mode, Parent-child mode uses a two-tier data structure that balances precise retrieval with comprehensive context, combining accurate matching and richer contextual information. In this mode, parent chunks (e.g., paragraphs) serve as larger text units to provide context, while child chunks (e.g., sentences) focus on pinpoint retrieval. The system searches child chunks first to ensure relevance, then fetches the corresponding parent chunk to supply the full context—thereby guaranteeing both accuracy and a complete background in the final response. You can customize how parent and child chunks are split by configuring delimiters and maximum chunk lengths. For example, in an AI-powered customer chatbot case, a user query can be mapped to a specific sentence within a support document. The paragraph or chapter containing that sentence is then provided to the LLM, filling in the overall context so the answer is more precise. Its fundamental mechanism includes:- Query Matching with Child Chunks:
- Small, focused pieces of information, often as concise as a single sentence within a paragraph, are used to match the user’s query.
- These child chunks enable precise and relevant initial retrieval.
- Contextual Enrichment with Parent Chunks:
- Larger, encompassing sections—such as a paragraph, a section, or even an entire document—that include the matched child chunks are then retrieved.
- These parent chunks provide comprehensive context for the Language Model (LLM).
In this mode, you need to manually configure separate chunking rules for both parent and child chunks based on different document formats or specific scenario requirements.
Parent Chunk
The parent chunk settings offer the following options:
-
Paragraph
This mode splits the text in to paragraphs based on delimiters and the maximum chunk length, using the split text as the parent chunk for retrieval. Each paragraph is treated as a parent chunk, suitable for documents with large volumes of text, clear content, and relatively independent paragraphs. The following settings are supported:
- Chunk Delimiter: The system automatically chunks the text whenever the specified delimiter appears. The default value is
\n\n, which chunks text by paragraphs. - Maximum chunk length: Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking.
- Chunk Delimiter: The system automatically chunks the text whenever the specified delimiter appears. The default value is
- Full Doc Instead of splitting the text into paragraphs, the entire document is used as the parent chunk and retrieved directly. For performance reasons, only the first 10,000 tokens of the text are retained. This setting is ideal for smaller documents where paragraphs are interrelated, requiring full doc retrieval.
Child Chunk
Child chunks are derived from parent chunks by splitting them based on delimiter rules. They are used to identify and match the most relevant and direct information to the query keywords. When using the default child chunking rules, the segmentation typically results in the following:
- If the parent chunk is a paragraph, child chunks correspond to individual sentences within each paragraph.
- If the parent chunk is the full document, child chunks correspond to the individual sentences within the document.
- Chunk Delimiter: The system automatically chunks the text whenever the specified delimiter appears. The default value is
\n, which chunks text by sentences. - Maximum chunk length: Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking.
- Replace consecutive spaces, newline characters, and tabs
- Remove all URLs and email addresses
To ensure accurate content retrieval, the Parent-child chunk mode only supports the High-Quality Indexing.
What’s the Difference Between Two Modes?
The difference between the two modes lies in the structure of the content chunks. General Mode produces multiple independent content chunks, whereas Parent-child Mode uses a two-layer chunking approach. In this way, a single parent chunk (e.g., the entire document or a paragraph) contains multiple child chunks (e.g., sentences). Different chunking methods influence how effectively the LLM can search the knowledge base. When used on the same document, Parent-child Retrieval provides more comprehensive context while maintaining high precision, making it significantly more effective than the traditional single-layer approach.
Reference
After choosing the chunking mode, refer to the following documentation to configure the indexing method and retrieval method and finis the creation of your knowledge base.Select the Indexing Method
Check for more details.