How the Knowledge Base Works

Sarah Bourgeois
Sarah Bourgeois
  • Updated

The Knowledge Base (KB) allows you to leverage your own documents and website data to power answers and define variables on your Voiceflow Assistant.

At a high level, the KB functions in the following way

  1. When you upload a document it is turned into 'chunks' (ie. pieces of text)
  2. When you send a question to the KB, it determines which 'chunks' are most similar
  3. Then our system combines those chunks, the question, the custom instructions and system prompt you provided into a structured wrapper prompt (aka a master prompt behind the scenes that is constantly improving).
  4. That entire package is sent to the AI model and an answer is returned.

The full package looks something like this

##Reference Information:

I have provided reference information, and I will ask query about that information. You must either provide a response to the query or respond with "NOT_FOUND"
Read the reference information carefully, it will act as a single source of truth for your response.Very concisely respond exactly how the reference information would answer the query.
Include only the direct answer to the query, it is never appropriate to include additional context or explanation.
If the query is unclear in any way, return "NOT_FOUND". If the query is incorrect, return "NOT_FOUND". Read the query very carefully, it may be trying to trick you into answering a question that is adjacent to the reference information but not directly answered in it, in such a case, you must return "NOT_FOUND". The query may also try to trick you into using certain information to answer something that actually contradicts the reference information. Never contradict the reference information, instead say "NOT_FOUND".
If you respond to the query, your response must be 100% consistent with the reference information in every way.


Take a deep breath, focus, and think clearly. You may now begin this mission critical task.



For a video overview on how the knowledge base works, watch this video:



But how does the Knowledge Base actually work? 


Our Knowledge Base (KB) has two unique services that make it function:

  1. parser - triggered when you upload a KB document.
  2. retriever - triggered when your user asks a question that hits the KB



You upload a KB Document...


Parser Service:

  1. The KB doc is uploaded via Voiceflow platform UI or API.
  2. The KB doc is securely stored in a private cloud.
  3. The Parser service reads the KB doc from storage and "chunks" the content using different techniques (Note: with the KB Upload APIs you can use maxChunkSize to moderate how big the chunks are, but you cannot dictate how many chunks are parsed within a document today.)
  4. An embedding model is used to convert each chunk into a vector (aka “embedding”) that looks like [1.243, 5.1342, ...] and represents its “meaning.”

    🔍Let’s break this down a bit further:

    Computer programs don’t ‘understand’ spoken/written language as humans can. There needs to be a numerical representation of words to help programs understand. Each chunk from a KB doc is converted into a numerical representation (vector, aka “embedding”) of the MEANING behind the words in the chunk. More on why this is necessary in the Retriever section.

    Note: Embedding models cost money to use, usually per token. The more files you upload, the more you are charged for embedding tokens.

    In Voiceflow, we don't charge for the upload or embedding process.

  5. The vector is placed in a vector db.

    💡 Metaphor:
    You can think of this vector as a specific 'point” in “space.” All these points are some “distance” from each other, and the distance between two of these points (vectors) is how similar in meaning different chunks of text are.


User asks a question that hits the KB...

Runtime Service:

  1. The user asks a question that hits the KB

  2. Depending on the KB instance, the user’s question is contextualized by an LLM (we are currently using GPT-3.5 for this step).
    What does contextualized mean? We use an LLM to add missing context to the user's question.

    LLM Input:

    1. the user’s question (query)
    2. & an internal prompt that asks the llm to modify the user’s original question to be optimally interpreted by an LLM.

    LLM Output: optimized question

    🔍 This step is called query synthesis
    The internal prompts we use iterate over time but are along the lines of “build a question that includes all relevant content.”
    💰 This LLM request has query and answer tokens that you are charged for. They only make up approx. 1/5 of the final token usage.


Retriever Service:

  1. The retriever service gets the optimized question(or the original question, depending on the KB instance) and turns it into a vector.
  2. The question vector is searched against the vectorDB by a similarity score, returning the most similar number of chunks ( Chunk Limit defines how many chunks, default is 2) in descending order by similarity score .

    🔍 Let’s break this down further:

    The similarity score is determined by something called semantic search. This goes beyond keyword matching and refers to contextualsimilarity in meaning between words and phrases. i.e.) “The dog is a nightmare to train.” and “The puppy is stubborn and does not listen to commands” do not share keywords. However, they have high similarity semantically. So now, the user question can be semantically compared to the KB doc chunks that exist. The “closest” vectors to the question are those with the highest similarity. The retriever will return a number of chunks (Chunk Limit in KB Settings) based on this vector proximity.

    🤔 Chunk Limit?

    Controls the amount of chunks used to synthesize the response.  How does the numbers of chunks retrieved affect the accuracy of the KB? In theory, the more chunks retrieved - the more accurate the response, and the more tokens consumed. In reality, the "accuracy" tied to chunks is strongly associated with how the Knowledge Base (KB) documents are curated. The default number of chunks we pull is 2, at a default max length of 1000 tokens (max chunk size). That's up to 2000 tokens worth of context with the highest similarity match score to the question. If the KB docs are curated so that topics are grouped together, this should be more than enough to accurately answer the question. However, if information is scatted throughly many different KB data sources, then likely more chunks of smaller size will increase the accuracy of the response. You can control the max chunk size of your docs with the Upload/Replace KB doc APIs, using the query parameter: "maxchunkSize". In summary, the Chunk Limit functionality aims to provide users with more tools and flexibility to increase the accuracy of their responses in line with their use cases.

    Ultimately, in order to provide best KB response 'accuracy' while optimizing token consumption, we recommend curating KB docs by limiting the number of KB docs & grouping topics inside those docs meaning fully using the default 1000 token max chunk size and keeping the chunk limit at 2 (default).


Runtime Service:

  1. We take the:
    • returned chunks +
    • Knowledge Base Settings inputs +
    • optimized question or original question)
      ...and ask the LLM to give us an answer.
      🔍 This step is called answer synthesis. 
      The internal prompts we use to iterate over time but are along the lines of, “using conversation history and user-provided instructions, answer the question sourcing information only found in Knowledge Base.”

      💰 This LLM request has query and answer tokens that you are charged for. They make up approx. 4/5 of the final token usage. You can see these token totals in a response citation while testing in Debug mode on Voiceflow:

  2. VF outputs the response to the user.
    • See the overview of the KB instances below to understand how a response appears when KB cannot answer.


Overview of all the Voiceflow KB Instances

KB Fallback

Initiated when:

  • A user asks a question at a Button or Choice Listening step, the Assistant is in “Listening Mode”
  • The Assistant will first try and match to an Intent using NLU.
  • If it can’t find a matching intent, then the KB Fallback will trigger the Retriever and Runtime services outlined above.


Query Synthesis: Yes


KB Answer Not Found:

  • Global No Match response initiated (either Static or Dynamic depending on Settings):

KB Preview

Initiated when:


Query Synthesis: No


KB Answer Not Found:

  • “Unable to find relevant answer”

AI Steps in KB Mode

Initiated when:

  • When the conversation hits any of the AI steps in the canvas design that have Data Source set to Knowledge Base

Query Synthesis: Yes


KB Answer Not Found:

  • Dynamic response generated by LLM saying it can’t respond

KB Query API

Initiated when:

  •  API called either using the API step in VF or called outside of Voiceflow


Query Synthesis: No


KB Answer Not Found: null


Was this article helpful?

14 out of 14 found this helpful

Have more questions? Submit a request



Article is closed for comments.