Architecture Walkthrough
Query flow in a hybrid search assistant
Let’s unpack how a user query flows through this hybrid system step by step.
(1) Structured Filter Logic: The query is first parsed for any structured parameters –
for example, a location, date range, user ID, product name, etc. These filters are applied to the
underlying data store or index, immediately scoping the search to a relevant
subset. The system supports exact matches and even fuzzy
matching on these fields (so a filter for “Acme Corp” can match “Acme Corporation” if needed).
(2) Vector Search: Next, the remaining unstructured query text (e.g. the descriptive
part of the question) is turned into an embedding and run through a vector
search engine across the pre-filtered subset. This finds items that are semantically similar to
the query – using cosine or dot-product similarity – returning a set of results each with a similarity score. At this stage we have, say, the top N candidate results
that both met the structured criteria and are conceptually related to the query.
(3) OpenAI Assistant Integration: Now the magic of LLM integration kicks in. Instead
of treating the LLM as a black box, we use OpenAI’s function-calling (tool API)
ability to let the assistant actively participate in retrieval. The OpenAI Assistant is
configured with a custom “search” tool – when it needs information, it will generate a function call
(with arguments like search query and filters) that our system executes. The vector search results are
then fed back to the assistant (almost like giving it the contents of documents or a summary). Because
this all happens with clean asynchronous code, the LLM isn’t stuck
waiting – the system can fetch multiple results in parallel, handle timeouts, and stream data back
efficiently.
(4) Result Ranking and Synthesis: Finally, the results can be post-processed and
ranked. Often the vector engine’s similarity scores serve as a base ranking. We can further refine
ordering using business rules or even ask the LLM to re-rank or extract the best
answer. The end result is either a ranked list of relevant results or a composed answer that
cites the top sources. By the time the user gets a response, the system has systematically enforced
filters, leveraged semantic matching, and vetted the output through an LLM – all in a tightly
orchestrated pipeline. This architecture isn’t an academic exercise; it’s built for high performance, observability, and precision at scale.