The Core Problem with RAG: Why Knowledge Base Retrieval Results Are Inaccurate, from Chunk Strategy to Embedding Model Selection

Many teams building RAG initially focus most on the generation model:

Which large model to use for answering
Which platform to build on
How to write prompts

But after formal operations begin, the most common problem is usually not “the model cannot answer” but “the system simply did not retrieve the correct materials.”

In other words, the core problem with RAG is typically not on the generation side but on the retrieval side.

If retrieval results themselves are inaccurate, then the stronger the downstream model, the more likely it is to generate seemingly reasonable but actually misleading answers on top of incorrect context.

This article focuses on this problem: why knowledge base retrieval results are often inaccurate, and how enterprises should understand and address this in practice – from chunk strategy to embedding model selection.

1. Why RAG Often Performs “Inaccurately”

The ideal RAG flow is typically:

User asks a question
→ System retrieves the most relevant content from the knowledge base
→ Model generates an answer based on that content

The problem is concentrated in the second step. In reality, inaccurate retrieval typically manifests in three ways:

Materials clearly exist but were not retrieved
Retrieval hit some content, but not the most critical passage
Retrieval returned many chunks, but too much noise actually reduced answer quality

Many teams attribute these problems to “the large model being too unstable,” but the actual more common causes typically include:

Unreasonable document chunking approach
Insufficient data cleaning
Obvious semantic gap between user questions and document wording
Embedding model not matching the business corpus
Retrieval parameters not tuned for the specific scenario

2. First-Layer Problem: Chunk Strategy Determines the Retrieval Ceiling

One of the fundamental operations in RAG is splitting long documents into multiple chunks, then vectorizing and retrieving these chunks.

The problem is that once a document is split, the semantic relationships that belonged to the complete context may be destroyed.

A Typical Example

A sentence in the original document might be:

“Employees are entitled to 10 days of annual leave after completing six months of service.”

If after chunking, only this remains:

“after six months, entitled to 10 days”

This passage still retains some information, but when separated from context, its true meaning has become ambiguous. Whether it describes annual leave, allowances, probation periods, or training programs – the system cannot naturally determine this.

This is the most common first type of problem in many RAG systems: information still exists, but semantic completeness is lost after chunking.

3. Chunks Are Not Better When Smaller, Nor Better When Larger

In many practices, teams tend to develop an intuition: making chunks smaller will make retrieval more precise.

But the reality is not that simple.

Problems When Chunks Are Too Small

Context is lost
A complete rule gets split into multiple fragments
After retrieval hits, the model receives incomplete context

Problems When Chunks Are Too Large

Noise increases
Multiple topics get mixed into a single chunk
Although retrieval hits, the truly critical information is buried in a long paragraph

Therefore, the essence of chunk strategy is balancing two things:

Semantic completeness
Retrieval focus

A More Reasonable Practical Approach

Different document types should typically use different chunking approaches:

Policy documents: Chunk by clause or section
FAQ documents: Chunk by question-answer pair
Contract documents: Chunk by clause
Product documents: Chunk by feature or topic

In other words, a more effective chunk strategy is typically not based on fixed character counts but on chunking according to the document’s semantic structure.

4. Second-Layer Problem: Significant Gap Between How Questions Are Asked and How Documents Are Written

Another common reason for inaccurate RAG is the difference in expression style between user questions and knowledge base documents.

For example, users may ask directly:

“How many days of leave do I get?”
“Can I expense this?”
“What should I watch out for during a typhoon?”

While the document expressions tend to be much more formal:

“年次有給休暇の付与日数” (Number of annual paid leave days granted)
“旅費精算規程第八条” (Travel expense settlement regulations, Article 8)
“災害時の行動基準” (Standards of conduct during disasters)

User questions tend to be more colloquial, vaguer, and more compressed; document wording tends to be more formal, more complete, and more professional. Even when semantically related, their distance in vector space may not be close enough.

This is also why an increasing number of teams introduce enhancement strategies such as Query Rewrite and HyDE:

First rewrite user questions into forms closer to policy language or document language
Then use the rewritten query for retrieval

At their core, these methods bridge the gap between “how questions are asked” and “how documents are written.”

5. Third-Layer Problem: Data Quality Is Usually More Important Than Model Selection

Many RAG systems perform unstably not because the strategy is not sophisticated enough but because the underlying data itself is not clean enough.

Common issues include:

Multiple versions of the same content uploaded
Old and new versions of policies mixed together
PDF text extraction errors
Unstable OCR scan quality
A single document containing multiple unrelated topics
Headers, footers, page numbers, and stamps entering the body text

In these situations, even the best embedding model can only vectorize low-quality data.

Therefore, a very important principle is:

RAG retrieval quality depends first on whether the data input is clean enough.

Pre-upload cleaning, deduplication, version control, and structural organization are often more worthwhile to invest in first than complex post-hoc tuning.

6. Why Embedding Model Selection Is Critical

In many platforms, embedding tends to be treated as a default component – as if having one is sufficient.

But in reality, the embedding model directly determines how the system understands “similarity.”

Because the essence of vector retrieval is not keyword matching but how the embedding model maps text into semantic space.

What This Means

The same sentence, in different embedding models, may form different semantic proximity relationships. For example:

Some models are better at English technical corpus
Some models are better suited for Japanese or Chinese business text
Some models are better at short-sentence queries
Some models are more stable at compressing long-text context

If an enterprise knowledge base mainly consists of policies, contracts, and internal regulations – business corpus – but the selected embedding model is not well suited for this type of text structure, then inaccurate retrieval is a very natural outcome.

7. How Embedding Models Should Be Evaluated

In enterprise practice, embedding model selection should typically consider at least four dimensions.

1. Language Match

What language is the knowledge base primarily in, and what language do users primarily use for queries.

If there is a difference between the knowledge base language and user query language, particular attention should be paid to the model’s performance in cross-language semantic alignment.

2. Text Type Match

Whether it is FAQ, policy, contract, product documentation, or news content – different corpus types typically correspond to different performance characteristics.

3. Length and Granularity Performance

Some models are better at short-sentence matching, while others are more stable at semantic compression of long paragraphs.

4. Cost and Speed

When going live formally, enterprises also need to consider:

Index building cost
Index rebuild cost
Retrieval response speed
Whether local or private deployment is supported

Therefore, embedding model selection should not focus only on “which is most advanced” but primarily on “which is most suitable for the current business corpus and actual deployment conditions.”

8. Why Simply Switching the Large Model Usually Cannot Solve RAG Problems

When teams find RAG performing poorly, the common first reaction is to switch to a more powerful generation model.

But if the context retrieved is inaccurate, even the strongest generation model cannot fundamentally solve the problem. What it can typically do is limited to:

Generate incorrect answers based on incorrect context
Generate off-focus answers amid excessively noisy context
Fill in with common knowledge when materials are insufficient

This is also why in actual development, the recommended optimization order is:

Data cleaning
Chunk design
Query rewriting or retrieval enhancement
Embedding model evaluation
Finally, optimize the generation model and prompts

9. A More Realistic Optimization Sequence

If an enterprise is building an internal RAG system, it is recommended to troubleshoot problems in the following priority order.

Step 1: Check Data Quality

Are there duplicate documents
Are outdated materials mixed in
Can PDFs be reliably text-extracted
Is content reasonably split by topic

Step 2: Check Chunk Strategy

Are chunks split by semantic structure rather than pure character count
Are there instances of context being cut off
Are there chunks covering multiple topics

Step 3: Check Query Approach

Are user questions overly colloquial or vague
Is Query Rewrite or HyDE needed
Is coreference resolution needed in multi-turn conversations

Step 4: Check Embedding Model

Is it suitable for the current language
Is it suitable for the current document type
Has actual comparative testing been done, rather than simply using the default option

Step 5: Check Retrieval Parameters

Is Top K set reasonably
Are duplicate chunks crowding out results
Is re-ranking or additional retrieval strategies needed

10. The Essence of RAG Is Not “Stuffing Materials In”

Many people understand RAG as:

“As long as company materials are uploaded, AI will understand.”

But from a systems perspective, RAG is more like a retrieval system, and the generation model is simply the last layer of expression capability in that system.

Its key is not “how much it remembers” but:

At the right moment
From the right location
Finding the right context
Then handing it to the model for organization and output

Therefore, the core of RAG has never been “whether there is a knowledge base” but “whether the retrieval system truly understands your document structure and user questions.”

Conclusion

Inaccurate knowledge base retrieval results are usually not a single problem but the cumulative result of issues across multiple layers:

Unreasonable chunk splitting approach
Insufficient data quality
Excessive gap between user questions and document expressions
Embedding model not matching the corpus
Retrieval parameters not tuned for the business scenario

So, the core problem with RAG is not “how to make the model answer better” but:

How to make the system first find the content that truly should be answered.

Once this sequence is straightened out, many problems that initially appear to be “the model is not smart enough” will be traced back to more specific – and more solvable – retrieval engineering problems.

Keyboard shortcuts

MKC — Dify Japan Content System