This blog discusses Rovo search, the various search experiences it supports, and their functionality. We focus on the foundational search relevance stack, smart answers, and various personalization techniques employed. While we briefly mention the process of how content is indexed, we will primarily cover how relevant search results are returned once the content is already in the index.

Rovo search is a unified solution that assists customers in locating content across the diverse platforms utilized by their teams. The goal of Rovo is to enable customers to search not only across Atlassian products such as Confluence, Jira, and Bitbucket but also across third-party SaaS applications like Google Drive, Slack, SharePoint, and others that are heavily used for data sharing and management. Currently, we support enterprise search across 50+ different SaaS applications for our customers. The mechanism that connects a SaaS application to Rovo search is referred to as a connector. To learn more about the connectors available in Rovo, please visit Available Rovo Connectors | Atlassian.

Since search powers the context fetching for every Rovo interface – search, chat, content creation agents, etc., it is one of the most critical pieces of Rovo infrastructure. This is why search infrastructure and search relevance are considered as foundation blocks of Rovo.

Terminologies

Before we proceed with describing how all the magic happens, let us clarify some concepts we will be using throughout this blog.

Connector Connectors are integrations that link external Saas applications and data sources with Rovo. For example, Google Drive connector, SharePoint connector and so on.

OpenSearch is the platform we use to index content and provide a way to search. Atlassian is currently using AWS OpenSearch. Every piece of content is stored as a “document” in OpenSearch which includes text content and other metadata. You can configure any of these attributes to be searchable.

BM25 (Best Match 25) is a ranking function used by search engines to estimate the relevance of documents to a search query.

KNN refers to the K Nearest Neighbors algorithm to get the closest documents to a query based on semantic similarity. This is what we use to support semantic search.

LLM refers to Large Language model. We use multiple large language models at Atlassian.

Search experiences

There are multiple search experiences available in Rovo, as described in the following sections.

Quick find

The search box at the top of Atlassian products. There are several entry points where this is supported, for example: Home, Confluence, Jira.

This experience is about very quickly finding (refinding) content you engaged with recently.

Full search (advanced with filter support)

If you press enter on quick find experience, you will land in the advanced full search experience which supports various filters such as product, type, contributor, time, etc.

This does a deeper search that gets results that are relevant, authoritative, popular, and from products that you have an affinity for. Unlike quick find, it is not only fetching content based on your recent engagements. Your recent engagements, however, will help keep more relevant search results at the top of the list.

Smart Answers with citations

Smart Answers in search are designed to provide users with quick, AI-generated answers to their questions, grounded in authoritative content from across connected knowledge bases. To ensure credibility and user trust, Smart Answers include citations that reference the exact sources used to generate each answer.

Smart Answers can also display concise knowledge cards for certain intent categories.

Rovo Chat

Search is one of the core tools powering Rovo Chat, allowing the AI Assistant to generate relevant results to address the user’s query. The search capabilities of Rovo Chat are powered by the same underlying infrastructure as the Smart Answers experience.

How we search across 50+ different SaaS applications

We support various types of integrations for content platforms.

For many applications, we ingest most/all of the content into our search index and are, therefore, able to provide a rich and relevant search experience across that content. For example, Google Drive, Slack.

For some applications, we ingest content that is linked in our first-party products such as Confluence and Jira. This ensures that more recently linked third-party content is searchable in Rovo to users who can view the first-party content. For example, Figma.

For a small number of applications, we use a federated approach and integrate with the third-party search API so that Rovo is still a one-stop search solution. For example, Gmail and Outlook Mail.

We also listen to user activity signals for various fully ingested content platforms such as Google Drive and SharePoint, so we can use them to better rank search results.

We have multiple search stacks to help us scale to so many different products, such as a document search stack that focuses on knowledge base type of content, a messaging search stack that is designed for message type content, and a default stack that supports various other types of content such as designs, videos, tables, etc.

How search works

The high-level flow of a search request looks like this:

There are two main components in the overall flow.

The foundational search flow is where we converge for all content search. We have various layers in our search stack:

  1. Query intelligence – This is the first layer in our search stack that understands the intent of the query, for eg: natural language query, what type/product might the user be referring to, etc. The query rewriting logic rephrases the original query, fixes spellings/typos as needed, and handles acronyms. We use a pre-trained language model to perform classification and a self-hosted LLM to reformulate the query.
  2. Open search retriever L1 – This is the OpenSearch query that has various matching clauses, filters, permission check clause to match permissions indexed. This step is mainly using BM25 and KNN techniques for matching the text query against content in the index. We also have a rescoring block that uses a model to rerank retrieved documents based on index-level features such as popularity and contributors. We use a fine-tuned Gradient Boosted Decision Tree model for reranking that we host in OpenSearch itself.
  3. Semantic & Behavioral ranker L2 – This is where we rerank the results from OpenSearch even further based on richer and deeper query-to-document semantic relevance and behavioral signals. We fetch these signals from our feature store and invoke our reranking model hosted in AWS SageMaker. We have an in-house fine-tuned cross-encoder model for semantic relevance as well as a fine-tuned DCN (Deep & Cross Network) that uses behavorial signals. A DCN is a type of neural network architecture specifically designed to efficiently learn feature interactions, especially important in fields like recommender systems and click-through rate (CTR) prediction. The L2 Ranker was trained in a multi-task fashion and it outputs multiple scores at inference time, which includes semantic relevance score, pCTR, etc. A shallow layer built on top of it consumes these scores and generate a final ranking score. This shallow layer was optimized towards search success with relevance as guardrail.
  4. Interleaver L3 – This is where we get results from various products post-L2 and interleave them for a final result list. The interleaving is based on product affinity, semantic relevance between the query and the text content.

Once a user fires a search request, the front-end invokes a GraphQL search endpoint. The endpoint is located in a service called aggregator that is responsible for query intelligence and fanning out search queries across various products. We have multiple indices to support all the products. We parallelize across products to reduce latency and to manage the index mapping easily. The Query Intelligence component rephrases/rewrites the query and sends it further along. The fanout component sends each product-specific search request to another service called searcher that is responsible for directly communicating with AWS OpenSearch for content retrieval. The searcher service has two layers – content retrieval (L1) by honoring the user’s permissions, and reranking (L2) as described above. Post-reranking, we perform an additional permission check with the first-party or third-party product. This ensures the highest level of security of content and only shows users the content that they have access to. Once we fetch results for each product, we apply yet another ranker (L3) to interleave the results from various products. We decide the interleaving based on affinity for products and intent for a specific type of result. For example, the query “search improvements roadmap page” will prioritize Confluence pages whereas “SEARCH-4356” will ensure Jira tickets are prioritized in the final result set.

Smart Answers


When a user submits a query, it is first classified into one of several intents:

  1. Person → Resolves the person entity and displays smart card
  2. Team → Resolves the Team entity and displays smart card
  3. Bookmarks → Returns a direct link that has been “bookmarked” by your organization admin
  4. Natural Language → Invokes the Smart Answer workflow
  5. None → No smart answer is displayed and a standard search results page is shown

The first 3 query categories will lead to a near instantaneous answer to aid in faster navigational experiences. The natural language query category requires a full invocation of the smart answers workflow.

Smart Answer workflow

When a query is identified as a natural language query, it is routed to the search tool. The search tool first rewrites the query using Atlassian’s self-hosted query rewrite model. This step optimizes the original user input for search relevance.

When preparing a query for rewriting, we enrich the context provided to the query rewriter with user-specific information such as organization, location, and current time. This enables the rewriter to augment queries with relevant details (for example, transforming “when is the next holiday” into “when is the next holiday in the US”). We override the standard query intelligence layer here so we can pass in this additional context in query rewriting.

Once the query has been rewritten with the enhanced context, it is sent to our GraphQL search service, which will perform a search using the foundational search stack.

Once documents from many diverse data sources are returned, we have a transformation pipeline to convert all diverse document types into a unified data model. This allows us to handle 50+ connectors which can have varying forms of data such as documents, video transcripts, Slack messages, etc.

Now that we have assembled all of our documents into our unified data model, we chunk all documents into manageable passage snippets. After chunking, passages are ranked using cross-encoder models (e.g. ms-marco-MiniLM). This step is critical for selecting the most relevant chunks for answer generation.

Citations

When we chunk our documents, we also add page-level metadata such as the URL and the document title into each passage. When generating the answer, this offers a signal to “ground” the final answer on each passage source, allowing the LLM-generated answer to directly cite the document that was used to generate the answer. This is a passage-level citation, meaning that LLM-generated answers can display citations for individual claims made in the final smart answers response.

Relevance

Search results are considered relevant when the meaningful results show up and they are ranked effectively relative to one another. For example, if there is a recent Google Drive presentation that is a deep dive about search relevance, if a user searches for “Search relevance deep dive”, the presentation should show up in the top results. If the end user is the presenter or a part of the team that presented the slides, they should see it ranked in the top 3 results. There may be other content that also describes search relevance, but based on the recency and the user’s affinity, the most relevant content should show up at the top of the search results.

Index

We store the content, contributors, and metadata associated with the content in our content search index. In the case of messaging platforms, we maintain an index where we group related messages for richer, more meaningful context. We also enrich the index with popularity and authority signals such as number of contributors, engagement signals, type of container/folder (personal or public), so that we can influence recall with signals other than semantic or lexical matching. We also store user IDs, mentions and related metadata in the index, so that we can hydrate them in the UI while displaying search results.

Personalization

Rovo search results are personalized. We use various methods of personalization as follows:

  1. If a user has created the content themselves and searches for keywords in that content, it will rank higher.
  2. If the content was created by the user’s collaborators, it will rank higher.
  3. For messaging applications like Slack, Teams, etc., we show search results from the top channels that the user has been active in recently.

We use various features such as container type, content length, number of contributors, freshness and activity to determine the authority of content.

The popularity of content is calculated for the document using views, likes, comments, and any other available engagement signals.

Freshness is determined by the latest updated time and is treated differently across various products in months, weeks or days. We boost fresher results to the top and penalize older results.

How we evaluate search relevance

When you search in Rovo, you want the most relevant answers in sub-second latency. But how do we know if our search is actually working well? At Atlassian, we use a blend of user behavior signals, explicit feedback, and language model judgment to keep Rovo’s search results accurate and helpful.

We perform an online evaluation by combining real user behavior with explicit feedback and AI judgment. This ensures search results are not just fast, but truly relevant. This approach helps us build trust, improve productivity, and keep knowledge accessible to everyone.

Query Success Rate (QSR): Our north-star metric. QSR tracks if users found what they needed—whether by clicking a result, spending time on a page, or getting their answer directly from a Smart Answer. If users leave satisfied, that’s a win. QSR is a blend of multiple components to capture overall Rovo search success including:

  • Clicks and Dwell Time: We track which search results get clicked and how long users stay on a page surfaced by search, helping us spot which answers are truly useful.
  • Explicit Feedback: Thumbs up/down on answers and search results

We fuse these measurements to capture an overall success score across the search experience for both search results and smart answers.

Offline evaluation

We use successful search click-based query sets, human-curated golden query sets as well as synthesized query sets to get a huge set of query-document pairs i.e. for a query Qi, what target document Di should we expect to be returned.

  1. Recall@k Recall is measuring the % of queries that return the expected document in top k results. We measure with multiple values of k based on our user experience and page sizes.
  2. NDCG Normalized Discounted Cumulative Gain, is a metric used to evaluate the effectiveness of ranking algorithms, particularly in information retrieval and recommendation systems. It assesses how well a ranked list of items matches a list of ideal or expected rankings, taking into account both the relevance of items and their positions in the list. 
  3. MRR Mean Reciprocal rank measures how well we rank the target document for the query. In contrast to recall which only has a binary yes or no, MRR gives us a sense of whether the document is also ranked well.
  4. LLM-as-a-Judge: We can also measure how well our search stack is doing using an LLM. We submit various queries from our query sets to our search stack and collect the results. This data is then fed to a Large Language Model (LLM), which evaluates whether the ranking was good. The LLM acts as an automated judge, simulating how a human would assess the ranking. We use this method to build our training datasets for our fine-tuned models.

Why offline evalation Matters

Before we announced Rovo, we lacked online engagement for many products since we had no customers signed up yet, and Atlassian employees didn’t use all of the supported SaaS applications internally. When we release new connectors, we want to get an understanding of our stack quality and performance before we release them. In many such cases, we would like to do an offline evaluation to understand how we are doing and how we compare with our baseline with each improvement we make.

Stay tuned for more

Rovo search has been undergoing phenomenal and state-of-the-art upgrades as we keep adding new connectors and paying attention to customer needs. We have come a long way building a search relevance stack, an experimentation framework, and an evaluation pipeline, all of which are crucial to our success. Stay tuned for more seamless searching capabilities with Rovo. Start using Rovo now to make working easier. Rovo: Unlock organizational knowledge with GenAI | Atlassian

Unraveling Rovo search relevance