Today, organizations are making significant investments in AI to unlock the value of their internal knowledge and provide employees with access to it. Yet, despite what on the surface seems to be a solved problem, a fundamental challenge remains: how to structure and provide access to corporate knowledge in a way that is both accurate and scalable.
Below is the first in a series of articles by Integrio Systems, based not on theory but on hands-on experience in building conversational access to both structured (databases, APIs) and unstructured (documents, PDFs, etc.) enterprise data. We explored multiple architectural approaches to building an effective AI-powered corporate knowledge base — including Retrieval-Augmented Generation (RAG), direct database interaction via Model Context Protocol (MCP) over SQL, and API-driven MCP integrations. Each approach offers distinct advantages and trade-offs.
Our goal was simple: to determine what actually works in a real enterprise or product environment and is both accurate and scalable. Not just what is technically possible, but what delivers reliable answers, maintains data integrity, scales with organizational complexity, and remains maintainable over time.
Let’s talk about these two metrics.
Accurate means that the information provided to the user is fully true and not a result of LLM hallucination. A high level of accuracy is obviously important; otherwise, users can’t trust the information provided, and it takes more time to verify it than to bypass AI altogether. In most enterprise environments (corporate, government, engineering, etc.), any level of AI hallucination is not acceptable, as people put their jobs on the line when data is not accurate.
Scalable is not as obvious. What it means in this case is whether the model provides accurate answers as the number of datasets grows and the size of the context (which is one of the main current limitations of LLMs) increases. For example, it is easy to build an MCP model that provides accurate responses using five API calls, but it is a different situation when you need to support and describe a much larger number of APIs. When choosing how to design an AI implementation, it is critical to think ahead about how it will perform as the number of data sources grows.
We break down our findings, compare these approaches in practical scenarios, and provide guidance on when each method makes the most sense. Whether you are building an internal AI assistant, modernizing your knowledge management systems, or evaluating architectural options, we hope you will find this analysis helpful.
Kingdom of LLMs in Data Reasoning
RAG 1.0
Large Language Models (LLMs) are built on neural network architectures and, under the hood, rely on vast numbers of matrix computations and rather sophisticated mathematical machinery. For us: engineers, business analysts, or everyday users, the “magic” of LLMs lies in their ability to understand (in a very specific mathematical sense) the context provided as input to the model and return the most probable sequence of tokens, which we interpret as an answer. Early generations of transformer-based models, the architecture on which LLMs are built, such as BERT and GPT-2 did not produce particularly high-quality responses. Their practical applications were therefore relatively limited due to the high hallucinations and were mostly focused on working with embeddings - vector representations of words and text. A typical pipeline with these first-generation models consists of external search platforms like Elasticsearch, thus only the relevant documents were used in LLM prompting.
The next approach was based on embeddings: business data represented as documents were converted to vector representations by LLM, which is used as an embedding model, and stored in vector stores (ChromaDB, FAISS, QDrant, Pinecone, Milvus, PGVector + PostgreSQL) - special databases aimed at fast vector processing (comparison, retrieval etc.). If we need to retrieve data which are contextually similar to new customer data, the latter undergo vectorization and vector search is performed to retrieve the relevant documents (fig. 1). Being comparatively old-fashioned, this approach is still alive and widely used as supporting an additional step in data reasoning.
However, with the breakthrough of ChatGPT based on GPT-3.5, something unexpected happened. LLMs began to evolve beyond being merely intermediate tools for natural language processing or vector-based search engines. Instead, they emerged as powerful standalone components capable of independently analyzing data and generating coherent, meaningful responses. With the emergence of more powerful models capable of better understanding and analyzing context, the typical embedding-based workflow began to evolve.
An additional model was introduced into the pipeline (fig. 1), one that interprets the set of relevant documents retrieved from a vector database together with the user’s query and generates a final response. I.e., the model receives the context which consists of user query, relevant data extracted from vector storage based on similarity semantic search and answers the query by combining and reasoning on context. This approach makes it possible to combine the generative capabilities of LLMs with reliable, domain-specific knowledge stored in external data sources.
As a result, systems could ground their answers in actual documents rather than relying solely on the information encoded in the model’s parameters. This significantly improved factual accuracy, enabled access to up-to-date or proprietary data, and allowed organizations to build AI-powered assistants on top of their own knowledge bases. This approach became known as Retrieval-Augmented Generation (RAG 1.0) and quickly turned into one of the standard workflows for integrating LLMs into business software (fig. 2).
Unfortunately, practices based on the RAG approach are still far from perfection and often are not suitable for real business tasks.
Similarity semantic search relies on mathematical similarity (often, cosine or MRR metrics) and can miss useful information due to vague context, which results in wrong returned data.
Chunking. Usually, embedding models have limited context to generate embeddings, so data we need to vectorize undergo chunking, i.e., splitting into parts. Such chunks often overlap, which turns into multiple vectors with quasi-relevant information. The semantic search can return wrong and irrelevant vectors, which sometimes is catastrophic for qualitative enterprise software.
Missing relevant information in the data returned by semantic search can force a model to guess the omitted information, which often turns into wrong reasoning.
Context window length. This becomes a significant limitation for simpler and lower-cost models such as Llama 3.2 or Mixtral. These models often struggle to process very large contexts composed of lengthy documents retrieved through semantic search. When the retrieved material exceeds the effective context capacity of the model, performance and response quality can degrade. A similar challenge appears in other architectures as well, for example, in systems that rely on a large number of MCP tools, which we will discuss later.
To overcome these challenges with naive RAG approaches there are a number of improvement practices named as RAG 2.0.
RAG 2.0
RAG 2.0 significantly changes the simple and straightforward approach of the first generation. It is no longer a basic data pipeline, but a set of techniques and approaches focused on real business scenarios. RAG 2.0 involves the implementation of hybrid search and reranking, optimization of user queries, a multi-step reasoning stage (self-RAG) and work on errors (Corrective RAG). Since there are quite many different design implementations of RAG 2.0 applied to a variety of business cases, let’s consider the pipeline currently used in our applications (fig. 3).
User or customer questions are often imprecise, too general, and contain errors. Therefore, in order to increase the correctness of the model response and semantic search, reduce sensitivity to wording, cover synonymy/spelling, reveal hidden intent, and facilitate the selection of relevant indexes/fields, the Query Rewriting approach is used. For this, an LLM model with an appropriate prompt is utilized. It receives the raw user query and transforms it into a more relevant one with respect to the data contained in the system.
The next step is Query Decomposition. LLM generates a set of subqueries, each of which requires a separate retrieval stage. This approach allows you to break down a complex user query into simpler information needs and obtain more relevant documents for each aspect of the problem. In addition, LLM can determine the sequence of retrieval steps, forming a plan for obtaining information, in which the results of previous steps can be used as a context for forming subsequent queries. To do this, the model generates a set of refined search queries or a structured execution plan (retrieval plan), which allows you to sequentially access the vector database and aggregate the found documents before transferring them to the response generation stage (fig. 4). For example, let’s consider the query
”How did employee X behavior change after the transfer to department Y?”
Such a query is composite: to answer it, the system must obtain information about the transfer event, as well as the employee's behavior before and after this event. Therefore, in the RAG-pipeline, the query is first decomposed into several subqueries, which can be used separately to retrieve documents. LLM splits the query into subqueries, which can be like
”When was employee X transferred to department Y?”,
”What was the behavior or performance profile of employee X before the transfer?”
”What behavior changes or performance indicators were recorded after the transfer?”
Hence, the system gets three independent retrieval queries, which point to specific knowledge. For each subquery the semantic vector search is performed, then LLM aggregates the resulting data. In this example, the answer can be completed at this step, but in more specific cases Query Decomposition can be used to select relevant data processing tools from vector storage, which will be invoked in further steps. In addition, Query Decomposition can be iterative, i.e., if the result of the first subquery is necessary for the second one etc (multi-hop retrieval).
Hybrid Retrieval is an approach to information retrieval that combines dense retrieval (standard vector search) and sparse retrieval (BM25/keyword search). Its main goal is to combine the advantages of both methods to increase recall, stability of results, and accuracy during the retrieval stage in RAG systems. Sparse retrieval is a straightforward (not LLM-based) method to search for exact token matches.
These too retrievals do not use LLMs at all. Instead, Hybrid Retrieval is based on fusion ranking, a numerical technique which scores the results obtained via two retrievals in parallel manner. The scoring can be a) Reciprocal Rank Fusion: score = Σ 1 / (k + rank), where rank is the position of document in a list (documents are sorted in relevance to user query) and k is some constant (typically equals to 60) or b) Weighted Score Fusion, where each retrieval method is weighted by some positive value.
Tool Calls or Structured Fetch represent a more advanced component of the pipeline and, in practice, can be omitted when describing a general RAG 2.0 workflow, since they are more closely related to agentic architectures and Model Context Protocol (MCP)-based systems. In simple terms, tools are specialized functions that perform specific data operations, such as executing SQL queries, calling external APIs, or performing structured data retrieval. The results of these operations can then be incorporated into the model’s context to enrich the information available for reasoning. This integration enables the system to invoke specific tools, i.e., functions capable of executing SQL queries, API requests, or other data retrieval operations, which return structured information that complements the retrieved textual context provided to the model. To implement Tool Calls in a RAG architecture, we must define three key components:
Tool definitions using schemas (functions).
LLM tool invocation via function calling.
Returning structured data (usually JSON).
Each tool is defined as a function with a clearly specified JSON schema describing its parameters. The LLM receives a list of available tools and can invoke them when necessary to answer a query. Here is typical structure:
For instance, a tool designed to execute SQL queries on a database can be defined as follows:
{
"name": "sql_query",
"description": "Execute a SQL query on the company database",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "SQL query to execute"
}
},
"required": ["query"]
}
}
Basically, LLM sees the description of the tool and can decide whether it needs to be invoked or not based on the input query. If the decision is positive, LLM asks the system (actually, agent) to call it, return data, which, in turn, will be used by LLM to answer the query.
Generally speaking, Re-ranking can be applied after Tool Calls, since Tool Calls can be part of a Hybrid Retrieval and aggregated data undergo further re-ranking step (Fig. 5). In this scheme, in order to pick up relevant tools, input query should be “understood” by the LLM, which requires a separate call to the model. Nevertheless, in the simple cases, congruent tools may be chosen based on the keywords and if-else statements.
However, there are cases when semantic vector similarity search sometimes returns not documents with data, but descriptions of tools relevant to the user query, which must be called by the agent at Tool Calls step. We will analyze the latter scenario in more detail in the next article.
Re-ranking is a modern and very interesting approach for intelligent (LLM-based) filtering and ordering of results obtained after Hybrid Retrieval. Typically, it is implemented through a specialized cross-encoder model [Sentence Transformers, Huggingface, Cohere rerank, etc.], but any typical LLM can be used as such a model (although this step will be computationally slower than the cross-encoder model). It works as follows: the re-ranking model receives a user query and a set of documents from the Hybrid Retrieval step. Next, the model analyzes the relationships between the query and document tokens and returns some relevance score ∈ [0,1]. This allows us (basically, using the libraries and frameworks) to re-sort the documents obtained from Hybrid Retrieval, select the most relevant to the query and proceed to the next step of the workflow. Also, I would like to note that this particular approach is useful for the large amounts of retrieved documents and not recommended for TOP-5 or TOP-10 retrieved results.
Context planning is the process of deciding what information from retrieved documents/data should be included in the prompt sent to the LLM and how it should be structured. Because input data undergo final (or iterative stepwise) reasoning by LLM to answer the user query based on all data obtained in previous steps, we want to make sure that prompt won’t encompass a) irrelevant context, b) too many tokens, which cannot be processed by the LLM due to context window limits. LLM-based summarization can help address these issues by reducing the amount of “noisy” information. However, it requires an additional model call, which may become costly when using paid LLM APIs. If the input data is well-structured (JSONs), a separate relevance classifier (low-complexity model) can filter “noisy” chunks of data [https://arxiv.org/abs/2510.04633].
Solving the problem of large context window sizes is still an area of active research. In such cases, one of the most effective approaches is to use specialized multi-agent systems, where each agent-expert processes a specific domain assigned to it (such as SQL tables or other entities). After all agents-experts complete their tasks, the information is aggregated, and the aggregated data is then passed to the model to generate the final response.
Conclusion
In modern enterprise software, the adoption of LLM technologies is increasingly becoming a de-facto development standard around which system architectures are built. The use of LLMs in knowledge base analysis and question-answering systems requires the development of robust approaches and practices that ensure accurate and reliable results.
The evolution of LLM-based systems, from simple and straightforward RAG 1.0 implementations to RAG 2.0 architectures consisting of multiple patterns and complex, dynamic workflows with multi-step LLM calls, highlights the challenges that must be addressed. As the volume of data processed within context windows grows and the requirements for retriever quality increase, pipelines and workflows must continuously evolve. This progression naturally leads to the adoption of agent-based and multi-agent systems.
In this short article, we examine the first component of such systems, namely, the fundamental steps within the RAG 2.0 paradigm. These steps serve as a foundation for the further development of advanced multi-agent LLM solutions in enterprise environments.
FAQ
If the original RAG was a simple "patch" to stop an LLM from hallucinating, RAG 2.0 is a complete architectural overhaul. Early RAG systems simply took a user query, retrieved some documents from a vector database, and stuffed them into a generic LLM prompt.
RAG 2.0 treats the entire system—the retriever, the reranker, and the LLM—as a single, unified model. This deep integration means the system learns how to retrieve information specifically for the task at hand, rather than just doing a generic keyword or semantic search.
For a business, this results in higher accuracy, better handling of complex queries, and a significantly lower rate of hallucinations.
Standard LLMs are trained on public internet data; they don't know your internal sales reports, HR policies, or proprietary product specs. RAG 2.0 bridges this gap by acting as an "open-book exam" for the AI.
Grounding: Instead of relying on its internal memory, the LLM is forced to answer questions by citing and synthesizing information retrieved from your internal databases (Confluence, SharePoint, SQL, etc.).
Traceability: Because the answer is built from retrieved documents, a RAG 2.0 system can provide citations or links back to the source material. This allows employees to verify the answer and builds trust in the system.
Real-time Updates: If you update a document in your database, the AI’s knowledge updates instantly. You don't need to retrain or fine-tune the massive LLM.
The key business benefits include:
Reduced hallucinations: By tightly coupling retrieval with generation, the AI sticks to your provided data. This is critical for industries like finance, legal, and healthcare.
Handling deep knowledge: RAG 2.0 excels at specific, data-intensive queries that standard LLMs struggle with.
Cost efficiency: Optimized systems can use smaller, more efficient LLMs while maintaining high-quality results.
Yes, this is one of its strongest advantages for enterprise use. With a RAG 2.0 system, you can host the entire stack within your own secure environment (VPC) or locally.
The LLM never retains your data. It reads retrieved context, generates a response, and discards it. This allows businesses to work with sensitive data without violating GDPR, HIPAA, or internal policies.
While the technology is complex, implementation has become much more accessible. Frameworks like LlamaIndex and LangChain, along with managed cloud services, simplify the process.
A small team of software and data engineers can build a working system. The main effort usually goes into structuring and preparing internal data for effective retrieval.
We are moving from simple chatbots to compound AI systems. RAG 2.0 enables multiple specialized AI agents to work together.
An agent monitoring supply chain data.
An agent cross-referencing data with forecasts.
An agent drafting operational communications.
This architecture allows AI to act as a coordinated digital workforce rather than just a Q&A tool.
We use cookies and other tracking technologies to improve your browsing experience on our website. By browsing our website, you consent to our use of cookies and other tracking technologies.