Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Core Problem with RAG: Why Knowledge Base Retrieval Results Are Inaccurate, from Chunk Strategy to Embedding Model Selection

Many teams building RAG initially focus most on the generation model:

  • Which large model to use for answering
  • Which platform to build on
  • How to write prompts

But after formal operations begin, the most common problem is usually not “the model cannot answer” but “the system simply did not retrieve the correct materials.”

In other words, the core problem with RAG is typically not on the generation side but on the retrieval side.

If retrieval results themselves are inaccurate, then the stronger the downstream model, the more likely it is to generate seemingly reasonable but actually misleading answers on top of incorrect context.

This article focuses on this problem: why knowledge base retrieval results are often inaccurate, and how enterprises should understand and address this in practice – from chunk strategy to embedding model selection.

1. Why RAG Often Performs “Inaccurately”

The ideal RAG flow is typically:

User asks a question
→ System retrieves the most relevant content from the knowledge base
→ Model generates an answer based on that content

The problem is concentrated in the second step. In reality, inaccurate retrieval typically manifests in three ways:

  1. Materials clearly exist but were not retrieved
  2. Retrieval hit some content, but not the most critical passage
  3. Retrieval returned many chunks, but too much noise actually reduced answer quality

Many teams attribute these problems to “the large model being too unstable,” but the actual more common causes typically include:

  • Unreasonable document chunking approach
  • Insufficient data cleaning
  • Obvious semantic gap between user questions and document wording
  • Embedding model not matching the business corpus
  • Retrieval parameters not tuned for the specific scenario

2. First-Layer Problem: Chunk Strategy Determines the Retrieval Ceiling

One of the fundamental operations in RAG is splitting long documents into multiple chunks, then vectorizing and retrieving these chunks.

The problem is that once a document is split, the semantic relationships that belonged to the complete context may be destroyed.

A Typical Example

A sentence in the original document might be:

  • “Employees are entitled to 10 days of annual leave after completing six months of service.”

If after chunking, only this remains:

  • “after six months, entitled to 10 days”

This passage still retains some information, but when separated from context, its true meaning has become ambiguous. Whether it describes annual leave, allowances, probation periods, or training programs – the system cannot naturally determine this.

This is the most common first type of problem in many RAG systems: information still exists, but semantic completeness is lost after chunking.

3. Chunks Are Not Better When Smaller, Nor Better When Larger

In many practices, teams tend to develop an intuition: making chunks smaller will make retrieval more precise.

But the reality is not that simple.

Problems When Chunks Are Too Small

  • Context is lost
  • A complete rule gets split into multiple fragments
  • After retrieval hits, the model receives incomplete context

Problems When Chunks Are Too Large

  • Noise increases
  • Multiple topics get mixed into a single chunk
  • Although retrieval hits, the truly critical information is buried in a long paragraph

Therefore, the essence of chunk strategy is balancing two things:

  • Semantic completeness
  • Retrieval focus

A More Reasonable Practical Approach

Different document types should typically use different chunking approaches:

  • Policy documents: Chunk by clause or section
  • FAQ documents: Chunk by question-answer pair
  • Contract documents: Chunk by clause
  • Product documents: Chunk by feature or topic

In other words, a more effective chunk strategy is typically not based on fixed character counts but on chunking according to the document’s semantic structure.

4. Second-Layer Problem: Significant Gap Between How Questions Are Asked and How Documents Are Written

Another common reason for inaccurate RAG is the difference in expression style between user questions and knowledge base documents.

For example, users may ask directly:

  • “How many days of leave do I get?”
  • “Can I expense this?”
  • “What should I watch out for during a typhoon?”

While the document expressions tend to be much more formal:

  • “年次有給休暇の付与日数” (Number of annual paid leave days granted)
  • “旅費精算規程 第八条” (Travel expense settlement regulations, Article 8)
  • “災害時の行動基準” (Standards of conduct during disasters)

User questions tend to be more colloquial, vaguer, and more compressed; document wording tends to be more formal, more complete, and more professional. Even when semantically related, their distance in vector space may not be close enough.

This is also why an increasing number of teams introduce enhancement strategies such as Query Rewrite and HyDE:

  • First rewrite user questions into forms closer to policy language or document language
  • Then use the rewritten query for retrieval

At their core, these methods bridge the gap between “how questions are asked” and “how documents are written.”

5. Third-Layer Problem: Data Quality Is Usually More Important Than Model Selection

Many RAG systems perform unstably not because the strategy is not sophisticated enough but because the underlying data itself is not clean enough.

Common issues include:

  • Multiple versions of the same content uploaded
  • Old and new versions of policies mixed together
  • PDF text extraction errors
  • Unstable OCR scan quality
  • A single document containing multiple unrelated topics
  • Headers, footers, page numbers, and stamps entering the body text

In these situations, even the best embedding model can only vectorize low-quality data.

Therefore, a very important principle is:

RAG retrieval quality depends first on whether the data input is clean enough.

Pre-upload cleaning, deduplication, version control, and structural organization are often more worthwhile to invest in first than complex post-hoc tuning.

6. Why Embedding Model Selection Is Critical

In many platforms, embedding tends to be treated as a default component – as if having one is sufficient.

But in reality, the embedding model directly determines how the system understands “similarity.”

Because the essence of vector retrieval is not keyword matching but how the embedding model maps text into semantic space.

What This Means

The same sentence, in different embedding models, may form different semantic proximity relationships. For example:

  • Some models are better at English technical corpus
  • Some models are better suited for Japanese or Chinese business text
  • Some models are better at short-sentence queries
  • Some models are more stable at compressing long-text context

If an enterprise knowledge base mainly consists of policies, contracts, and internal regulations – business corpus – but the selected embedding model is not well suited for this type of text structure, then inaccurate retrieval is a very natural outcome.

7. How Embedding Models Should Be Evaluated

In enterprise practice, embedding model selection should typically consider at least four dimensions.

1. Language Match

What language is the knowledge base primarily in, and what language do users primarily use for queries.

If there is a difference between the knowledge base language and user query language, particular attention should be paid to the model’s performance in cross-language semantic alignment.

2. Text Type Match

Whether it is FAQ, policy, contract, product documentation, or news content – different corpus types typically correspond to different performance characteristics.

3. Length and Granularity Performance

Some models are better at short-sentence matching, while others are more stable at semantic compression of long paragraphs.

4. Cost and Speed

When going live formally, enterprises also need to consider:

  • Index building cost
  • Index rebuild cost
  • Retrieval response speed
  • Whether local or private deployment is supported

Therefore, embedding model selection should not focus only on “which is most advanced” but primarily on “which is most suitable for the current business corpus and actual deployment conditions.”

8. Why Simply Switching the Large Model Usually Cannot Solve RAG Problems

When teams find RAG performing poorly, the common first reaction is to switch to a more powerful generation model.

But if the context retrieved is inaccurate, even the strongest generation model cannot fundamentally solve the problem. What it can typically do is limited to:

  1. Generate incorrect answers based on incorrect context
  2. Generate off-focus answers amid excessively noisy context
  3. Fill in with common knowledge when materials are insufficient

This is also why in actual development, the recommended optimization order is:

  1. Data cleaning
  2. Chunk design
  3. Query rewriting or retrieval enhancement
  4. Embedding model evaluation
  5. Finally, optimize the generation model and prompts

9. A More Realistic Optimization Sequence

If an enterprise is building an internal RAG system, it is recommended to troubleshoot problems in the following priority order.

Step 1: Check Data Quality

  • Are there duplicate documents
  • Are outdated materials mixed in
  • Can PDFs be reliably text-extracted
  • Is content reasonably split by topic

Step 2: Check Chunk Strategy

  • Are chunks split by semantic structure rather than pure character count
  • Are there instances of context being cut off
  • Are there chunks covering multiple topics

Step 3: Check Query Approach

  • Are user questions overly colloquial or vague
  • Is Query Rewrite or HyDE needed
  • Is coreference resolution needed in multi-turn conversations

Step 4: Check Embedding Model

  • Is it suitable for the current language
  • Is it suitable for the current document type
  • Has actual comparative testing been done, rather than simply using the default option

Step 5: Check Retrieval Parameters

  • Is Top K set reasonably
  • Are duplicate chunks crowding out results
  • Is re-ranking or additional retrieval strategies needed

10. The Essence of RAG Is Not “Stuffing Materials In”

Many people understand RAG as:

“As long as company materials are uploaded, AI will understand.”

But from a systems perspective, RAG is more like a retrieval system, and the generation model is simply the last layer of expression capability in that system.

Its key is not “how much it remembers” but:

  • At the right moment
  • From the right location
  • Finding the right context
  • Then handing it to the model for organization and output

Therefore, the core of RAG has never been “whether there is a knowledge base” but “whether the retrieval system truly understands your document structure and user questions.”

Conclusion

Inaccurate knowledge base retrieval results are usually not a single problem but the cumulative result of issues across multiple layers:

  • Unreasonable chunk splitting approach
  • Insufficient data quality
  • Excessive gap between user questions and document expressions
  • Embedding model not matching the corpus
  • Retrieval parameters not tuned for the business scenario

So, the core problem with RAG is not “how to make the model answer better” but:

How to make the system first find the content that truly should be answered.

Once this sequence is straightened out, many problems that initially appear to be “the model is not smart enough” will be traced back to more specific – and more solvable – retrieval engineering problems.