Skip to main content

Semantic Chunking

The platform uses semantic-aware chunking to split documents intelligently, preserving context and structure for better retrieval.

Why Semantic Chunking?

Traditional character-based chunking has problems:

IssueCharacter-BasedSemantic Chunking
Context lossSplits mid-sentenceRespects boundaries
Structure ignoredLists split randomlyLists stay together
No hierarchyFlat chunksHeading paths preserved
Size varianceInconsistent200-400 token target

Chunking Pipeline

Document Input (PDF / DOCX / XLSX / URL)


┌───────────────────┐
│ Block Flattening │ ◀─── Preserve parent context
└────────┬──────────┘


┌───────────────────┐
│ Semantic Grouping │ ◀─── Group by block type
└────────┬──────────┘


┌───────────────────┐
│ Token Merging │ ◀─── Merge small groups
└────────┬──────────┘


┌───────────────────┐
│ Quality Filter │ ◀─── Remove junk chunks
└────────┬──────────┘


┌───────────────────┐
│ Overlap Injection │ ◀─── Add context bridges
└────────┬──────────┘


LangChain Documents

Phase 1: Block Flattening

Documents are parsed and flattened into structured blocks while preserving parent context:

// services/fileIngestionService.js

const flattenBlocks = (blocks, parentPath = []) => {
const flattened = [];

for (const block of blocks) {
const blockWithContext = {
...block,
_parentPath: [...parentPath],
};

flattened.push(blockWithContext);

// Track heading hierarchy
if (block.has_children && block.children) {
const childPath = isHeading(block)
? [...parentPath, getBlockText(block)]
: parentPath;
flattened.push(...flattenBlocks(block.children, childPath));
}
}

return flattened;
};

Example:

Input:
# Finance (heading_1)
## Invoices (heading_2)
- Item 1 (bulleted_list_item)
- Item 2 (bulleted_list_item)

Output:
[
{ type: 'heading_1', text: 'Finance', _parentPath: [] },
{ type: 'heading_2', text: 'Invoices', _parentPath: ['Finance'] },
{ type: 'bulleted_list_item', text: 'Item 1', _parentPath: ['Finance', 'Invoices'] },
{ type: 'bulleted_list_item', text: 'Item 2', _parentPath: ['Finance', 'Invoices'] },
]

Phase 2: Semantic Grouping

Blocks are grouped based on semantic category:

Block Categories

const getBlockCategory = (block) => {
const type = block.type;

if (isHeading(block)) return 'heading';
if (['bulleted_list_item', 'numbered_list_item', 'to_do'].includes(type)) return 'list';
if (type === 'toggle') return 'toggle';
if (type === 'table_row') return 'table';
if (type === 'callout') return 'callout';
if (type === 'code') return 'code';
if (type === 'quote') return 'quote';

return 'paragraph';
};

Grouping Rules

CategoryRuleMax Size
heading_groupHeading + following paragraphs400 tokens
listConsecutive list items (same type)15 items OR 400 tokens
tableConsecutive table rows400 tokens
codeAlways standaloneUnlimited
calloutAlways standaloneUnlimited
toggleToggle + children, standaloneUnlimited
paragraph_groupConsecutive paragraphs400 tokens (80% flush)

Grouping Algorithm

const MAX_GROUP_TOKENS = 400;
const MAX_LIST_ITEMS = 15;
const FLUSH_THRESHOLD = Math.floor(MAX_GROUP_TOKENS * 0.8); // 320 tokens

export const groupBlocksSemantically = (blocks) => {
const flatBlocks = flattenBlocks(blocks);
const groups = [];
let currentGroup = { blocks: [], category: null, headingPath: [] };

for (const block of flatBlocks) {
const category = getBlockCategory(block);
const currentTokens = estimateTokens(transformBlocksToText(currentGroup.blocks));

// Headings start new groups
if (isHeading(block)) {
flushGroup();
currentGroup.category = 'heading_group';
addBlockToGroup(block);
continue;
}

// Lists: keep together up to limit
if (category === 'list') {
const listItemCount = currentGroup.blocks.filter(b =>
['bulleted_list_item', 'numbered_list_item', 'to_do'].includes(b.type)
).length;

if (currentGroup.category === 'list' &&
currentTokens < MAX_GROUP_TOKENS &&
listItemCount < MAX_LIST_ITEMS) {
addBlockToGroup(block);
} else {
flushGroup();
currentGroup.category = 'list';
addBlockToGroup(block);
}
continue;
}

// Code/callout: always standalone
if (category === 'code' || category === 'callout') {
flushGroup();
currentGroup.category = category;
addBlockToGroup(block);
flushGroup();
continue;
}

// Paragraphs: flush at 80% capacity
if (currentTokens >= FLUSH_THRESHOLD) {
flushGroup();
}
addBlockToGroup(block);
}

flushGroup();
return mergeSmallGroups(groups);
};

Phase 3: Small Group Merging

Groups under 200 tokens are merged with predecessors sharing the same heading path:

const MIN_GROUP_TOKENS = 200;

export const mergeSmallGroups = (groups) => {
const merged = [];

for (const group of groups) {
// Skip merging for code and lists (intentional splits)
if (group.tokens >= MIN_GROUP_TOKENS ||
group.category === 'code' ||
group.category === 'list') {
merged.push(group);
continue;
}

// Find predecessor with same heading path
let target = null;
for (let j = merged.length - 1; j >= 0; j--) {
if (arraysEqual(merged[j].headingPath, group.headingPath)) {
target = merged[j];
break;
}
}

if (target) {
// Merge into target
target.content = target.content + '\n\n' + group.content;
target.tokens = estimateTokens(target.content);
} else {
merged.push(group);
}
}

return merged;
};

Phase 4: Quality Filtering

Junk chunks are filtered out:

const JUNK_PATTERNS = [
/^\[Table of Contents\]$/i,
/^\[Breadcrumb\]$/i,
/^---+$/,
/^\[Link to page\]$/i,
/^\s*$/,
/^[-_=\s]+$/,
];

export const shouldIndexChunk = (group) => {
const trimmed = (group.content || '').trim();

if (trimmed.length < 20) return false;
if ((group.tokens || 0) < 10) return false;

for (const pattern of JUNK_PATTERNS) {
if (pattern.test(trimmed)) return false;
}

return true;
};

Phase 5: Overlap Injection

Trailing overlap bridges context between chunks:

const OVERLAP_CHARS = 400; // ~100 tokens

for (let i = 1; i < processedGroups.length; i++) {
const prevContent = processedGroups[i - 1].content;

if (prevContent.length > 50) {
let overlap = prevContent.substring(prevContent.length - OVERLAP_CHARS);

// Prefer sentence boundary
const sentenceBreak = overlap.indexOf('. ');
if (sentenceBreak > 0 && sentenceBreak < overlap.length * 0.5) {
overlap = overlap.substring(sentenceBreak + 2);
}

processedGroups[i].overlapBefore = overlap.trim();
}
}

Final Chunk Structure

Each chunk includes rich metadata:

{
pageContent: "[Finance > Invoices > Approval Rules]\n\n...content...",
metadata: {
// Page context
workspaceId: "ws-123",
sourceId: "page-456",
documentTitle: "Finance Policy",
documentUrl: "https://retrieva.online/sources/...",

// Semantic metadata
block_type: "list",
heading_path: ["Finance", "Invoices", "Approval Rules"],
block_types_in_chunk: ["bulleted_list_item"],

// Size info
estimatedTokens: 285,
blockCount: 12,
chunkIndex: 3,
totalChunks: 15,

// Special flags
is_code: false,
is_table: false,
is_list: true,
code_language: null,

// Overlap tracking
has_overlap: true,
overlap_chars: 95,
}
}

Token Estimation

Accurate token counting for size management:

// utils/rag/tokenEstimation.js

export function estimateTokens(text, options = {}) {
if (!text) return 0;

// Language-aware heuristics
const charsPerToken = detectLanguage(text) === 'cjk' ? 1.5 : 4.5;

// Adjust for code (more special chars)
if (options.isCode) {
return Math.ceil(text.length / 3.0);
}

return Math.ceil(text.length / charsPerToken);
}

Configuration

Environment variables for tuning:

# Chunk size targets
MAX_GROUP_TOKENS=400
MIN_GROUP_TOKENS=200
MAX_LIST_ITEMS=15

# Overlap
OVERLAP_CHARS=400

# Safety limits
MAX_CHUNK_TOKENS=800
MAX_CHUNK_CHARS=3600

Metrics

Typical chunking results:

MetricTargetTypical
Avg chunk size200-400 tokens285 tokens
Chunks needing split<10%5%
Junk chunks filtered-8%
Chunks with heading path>80%92%