← Tech Blog

Query Routing for LLM Applications

February 3, 2026
Introduction

Many LLM-based applications are not single-purpose. A chatbot might need to fetch live stock prices for one query and look up a weather forecast for another. The job of figuring out which handler to invoke for a given user message is called query routing, and getting it wrong means either failing the user or sending the query to a handler that will return nonsense.

The obvious solution is to let a general-purpose LLM handle routing — describe the available capabilities and ask it which one applies. No custom code required. The problem is latency. A cold LLM call easily takes 30–60 seconds end-to-end, and more complex agentic workflows that write SQL against internal tables can take 5+ minutes. In many situations, this blocks the user from getting the data they need quickly.

The naive fallback — a chain of if/elif blocks checking for keywords — breaks almost immediately in production. Users misspell words, rephrase intent in unexpected ways, and ask semantically equivalent questions using completely different vocabulary. In this article we build a small routing library that handles these cases using two complementary techniques: Jaccard similarity over character n-grams, and sentence embedding cosine similarity. We then compose them so that the more powerful (but heavier) embedding approach is preferred when available, with the lightweight Jaccard approach as a reliable fallback. The tradeoff is upfront implementation work. Jaccard routing itself takes < 1 ms; the full round-trip — routing plus calling the right downstream API and returning a response — should come back within a couple hundred milliseconds. You can push this even lower by bundling the route signal and input data into a single message rather than two sequential calls, keeping the full response time under a few hundred milliseconds. This is much faster than more general-purpose LLM/RAG systems as described above.

The Interface

We start by defining the contract. Route is an enum of the available destinations. RouterPort is an abstract base class that carries the shared route description dictionary and declares a single method, find_route, that all implementations must satisfy. The dictionary maps natural-language descriptions of each intent to its Route value; these descriptions are what the router will compare incoming queries against.

class Route(Enum):
    STOCK_PRICES = "STOCK_PRICES"
    WEATHER = "WEATHER"

class RouterPort(ABC):
    ROUTER_OPTIONS: dict[str, Route] = {
        "stocks": Route.STOCK_PRICES,
        "find me stock prices": Route.STOCK_PRICES,
        "stock market": Route.STOCK_PRICES,
        ...,
        "weather": Route.WEATHER,
        "temperature": Route.WEATHER,
        "what is the weather?": Route.WEATHER,
        "what is the temperature?": Route.WEATHER,
        ...
    }

    @abstractmethod
    def find_route(self, query: str) -> Route:
        pass

The descriptions in ROUTER_OPTIONS serve as a training set of sorts — they encode the vocabulary and phrasing the router should recognize for each intent. Adding a new route is simply a matter of adding a new Route value and populating the dictionary with representative phrases.

Jaccard Similarity with Character N-grams

Jaccard similarity between two sets and is defined as

and ranges from 0 (disjoint) to 1 (identical). If we tokenize queries into words and compare word sets, we get a reasonable similarity measure — but it is brittle. A user who types “wheather” instead of “weather” produces a word token that shares no overlap with any route description.

Character n-grams solve this. We slide a window of length across the full string — spaces included — producing overlapping character substrings. A single typo corrupts at most consecutive n-grams while leaving the rest intact. The query “wheather” with produces {"whea", "heat", "eath", "athe", "ther"}, which overlaps substantially with the n-grams of “weather”. Including spaces means n-grams also capture context at word boundaries: the substring "ck p" (a consonant cluster, a space, and the next word's first letter) is shared between two strings that differ only in vowels around that cluster. It also handles a typo class that per-word n-grams cannot: space insertions and deletions. A query like “stockprices” or “sto ck” splits into the wrong word tokens entirely, but its full-string trigrams still overlap substantially with “stock prices.” We use a mix of n-gram sizes (4, 5, and 6) unioned together, along with the word-level tokens, so the scorer benefits from both word-level and character-level signal.

class JaccardRouter:
    @staticmethod
    def jaccard_similarity(set1: set[str], set2: set[str]) -> float:
        intersection = set1 & set2
        union = set1 | set2
        if not union:
            return 0.0
        return len(intersection) / len(union)

    @staticmethod
    def tokenize_string(s: str) -> set[str]:
        return set(s.split())

    @staticmethod
    def compute_ngrams(s: str, n: int) -> set[str]:
        ngrams = set()
        if len(s) < n:
            return ngrams
        for i in range(len(s) - (n - 1)):
            ngram = s[i : i + n]
            ngrams.add(ngram)
        return ngrams

    @staticmethod
    def compute_all_ngrams(s: str, n: list[int]) -> set[str]:
        result = JaccardRouter.tokenize_string(s)
        for k in n:
            result |= JaccardRouter.compute_ngrams(s, k)
        return result

    @staticmethod
    def get_best_candidate(request: str, candidate_routes: set[str], n: list[int]
                           ) -> Optional[Tuple[str, float]]:
        max_score = 0.0
        candidate, candidate_score = "", 0.0
        for desc in candidate_routes:
            jac = JaccardRouter.jaccard_similarity(
                JaccardRouter.compute_all_ngrams(request, n),
                JaccardRouter.compute_all_ngrams(desc, n)
            )
            if jac > max_score:
                max_score = jac
                candidate, candidate_score = desc, jac
        if candidate == "":
            return None
        return candidate, candidate_score

get_best_candidate scores the incoming query against every description in ROUTER_OPTIONS and returns the best match. The adapter below wires this into the port interface, resolving the winning description back to its Route value.

class JaccardRouterAdapter(RouterPort):
    def find_route(self, query: str) -> Route:
        candidate_route_strings = set(RouterPort.ROUTER_OPTIONS.keys())
        best_candidate, _ = JaccardRouter.get_best_candidate(
            query, candidate_route_strings, n=[4, 5, 6]
        )
        route = RouterPort.ROUTER_OPTIONS[best_candidate]
        if route is None:
            raise MissingRouteException()
        return route

To see why character n-grams matter, consider the query “find me stck prces” compared against two route descriptions: “find me stock prices” and “find me weather.”

Word-level Jaccard scores the query as {find, me, stck, prces}. The intersection with “find me weather” is {find, me}, giving . The intersection with “find me stock prices” is also {find, me}, giving . Word Jaccard picks the wrong route.

Full-string trigrams recover the correct answer. The 16 trigrams of “find me stck prces” include the shared prefix substrings fin ind "nd " "d m" " me" "me " "e s" " st", plus cross-word substrings "ck " "k p" " pr" that appear in both “stck prces” and “stock prices” because the consonant cluster and the following space are identical despite the different vowels, plus the shared tail ces. Together with the word tokens, 14 of the 28 union tokens are shared with “find me stock prices” — . Only 8 are shared with “find me weather”, which diverges after "e s". The trigram scorer picks the right route by a clear margin.

Embedding-Based Routing

Jaccard over n-grams handles typos well but can struggle with paraphrases that share little surface-level text. “What’s the temperature outside?” and “Is it going to rain?” are semantically related to weather but share almost no n-grams with the description “weather forecast.” Sentence embeddings address this by mapping strings into a high-dimensional vector space where semantically similar strings land near each other regardless of surface form.

To compare two vectors we use cosine similarity rather than Euclidean distance. Distance becomes unreliable in high-dimensional spaces — as dimensionality grows, distances between points concentrate and lose discriminative power (the curse of dimensionality). Cosine similarity sidesteps this by measuring the angle between vectors instead of how far apart they are, which stays meaningful regardless of dimension.

The diagram below shows two vectors in 2D space and the angle between them. The plot shows how behaves over : an angle of zero gives a similarity of 1 (identical direction), and the score decreases toward 0 as the vectors diverge. In practice, embedding models are trained specifically to produce this geometry — for example via contrastive learning, where the model is given pairs of similar and dissimilar strings and learns to place them close together or far apart in the vector space accordingly.

Cosine similarity between two vectors
Plot of cos(θ) from 0 to π/2

At construction time, we encode every route description once and cache the resulting vectors. At query time, we encode the incoming query, compute cosine similarity against each cached vector, and return the route corresponding to the highest score.

class EmbeddingRouterAdapter(RouterPort):
    @staticmethod
    def similarity(a: np.ndarray, b: np.ndarray) -> float:
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def __init__(self) -> None:
        sentence_keys = list(RouterPort.ROUTER_OPTIONS.keys())
        model_name = "sentence-transformers/all-MiniLM-L6-v2"
        self._model = SentenceTransformer(model_name)
        self._route_embeddings = self._model.encode(sentence_keys)

    def find_route(self, query: str) -> Route:
        query_embedding = self._model.encode([query])[0]
        similarities = [
            EmbeddingRouterAdapter.similarity(query_embedding, emb)
            for emb in self._route_embeddings
        ]
        best_idx = np.argmax(similarities)
        route_keys = list(RouterPort.ROUTER_OPTIONS.keys())
        best_route = route_keys[best_idx]
        return RouterPort.ROUTER_OPTIONS[best_route]

Loading a sentence transformer model takes a few seconds. This cost is paid once at startup and is negligible afterward — query-time encoding of a short string is fast. However, in some environments the model may not be available: the host may lack the memory, the dependency may not be installed, or the model download may time out. This motivates a fallback strategy.

LLM-Based Routing

A third option requires no local ML model at all: just ask an LLM. The system prompt lists the available routes; the model reads the user’s query and replies with exactly one route name. This handles completely novel phrasing and implicit intent better than either of the above approaches, and the only dependency is an API client. The tradeoff is latency — each routing decision requires a network round-trip — and per-call cost, which makes it unsuitable at high QPS but perfectly reasonable for moderate traffic.

class LLMRouterAdapter(RouterPort):
    _SYSTEM_PROMPT = (
        "You are a query router. Given a user message, reply with exactly one of "
        "the following route names and nothing else: "
        + ", ".join(r.value for r in Route)
    )

    def __init__(self, client: anthropic.Anthropic) -> None:
        self._client = client

    def find_route(self, query: str) -> Route:
        response = self._client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=20,
            system=self._SYSTEM_PROMPT,
            messages=[{"role": "user", "content": query}],
        )
        route_name = response.content[0].text.strip()
        return Route(route_name)

Using a small, fast model like Haiku keeps latency and cost low while still benefiting from full language understanding. The max_tokens=20 cap enforces the single-token response contract and prevents the model from elaborating.

Composition and Fallback

CompositionRouterAdapter attempts to initialize the embedding adapter with a timeout. If it succeeds, all subsequent queries are routed through embeddings. If it fails for any reason, it silently falls back to the Jaccard adapter. Either way, the caller gets the same RouterPort interface and is unaware of which implementation is running.

class CompositionRouterAdapter(RouterPort):
    def __init__(self) -> None:
        executor = ThreadPoolExecutor(max_workers=1)
        future: Future[EmbeddingRouterAdapter] = executor.submit(EmbeddingRouterAdapter)
        try:
            self._adapter: RouterPort = future.result(timeout=60)
        except Exception:
            self._adapter = JaccardRouterAdapter()
        finally:
            executor.shutdown(wait=False)

    def find_route(self, query: str) -> Route:
        return self._adapter.find_route(query)
Tradeoffs

Jaccard over n-grams is fast, requires no ML dependencies, and is robust to typos. Its weakness is vocabulary: it can only match what it can see at the surface level, and novel phrasings with no textual overlap will score poorly. It is also sensitive to the quality of the descriptions in ROUTER_OPTIONS — a sparse or poorly worded description set will produce unreliable results.

Embedding-based routing handles paraphrase and semantic drift much better, at the cost of a model loading step and a heavier dependency. For most production deployments where the environment is controlled, this is the right default. The composition pattern lets you ship the embedding router with confidence while keeping the Jaccard adapter as insurance against unexpected environments or model failures.

LLM-based routing offers the strongest language understanding with no local ML dependencies, at the cost of per-call latency and API spend. It works well when routing decisions are infrequent or when query complexity genuinely requires language model reasoning to resolve ambiguity. It can also slot into the composition chain: try embeddings first, fall back to the LLM for low-confidence cases, and use Jaccard as the last resort.

A few other considerations: