Comment ranker - An ML-based classifier to improve LLM code review quality using Atlassian’s proprietary data

Code reviewer agent was developed by the DevAI org as one of the major LLM-based Rovo Dev agents to improve developer productivity. Its open beta program was launched at Team25 in Anaheim and it already has 10K+ MAUs and helps review 43K+ PRs monthly for beta customers. Our team did a study which shows it reduced PR cycle time by 30% for pull requests resolved by the code reviewer agent, compared to those resolved by humans. It has already received numerous positive feedbacks from both internal and external customers. It has been quite a journey to improve its quality over a year of development.

Without any filtering process in place, LLM generated comments can be very noisy, nit-picky or even factually wrong which directly leads to users’ negative feedback. This is confirmed from both the early stage of our internal dogfooding as well as studies in the industry.

Historically, we have implemented multiple filters to select better comments based on heuristics or simple data analysis. One example of such a heuristic is categorizing a generated comment using an LLM, and only choosing comments of certain categories. These heuristic-based filters have proven to be very useful in improving precision albeit giving up recall but they are still not sophisticated enough to fully leverage the data, ML, and state-of-the-art (SOTA) model architectures like transformers to improve the overall quality.

With those in mind, we developed an ML model used to rank and select good comments. We named this model as “comment ranker” or “ML comment ranker”. The goals of the ML comment ranker are:

Improve the quality of code reviewer comments posted on the PR (e.g. improve offline and online metrics).
Improve the maintainability and responsiveness to act on “newer” patterns of problems due to foundation model changes or product rollouts.
Consolidate the code reviewer comment selection process to be more holistic and scientific (e.g. replacing multiple heuristics based filters).
Reduce the cost, latency, and output stochasticity due to the use of LLMs.

How the Comment Ranker Works

The code reviewer agent gets triggered when a PR is created, and it first calls a LLM to generate original code review comments based on the code diff, PR title/description, and Jira work item/description. Those originally generated comments now become the input for the comment ranker to perform its task of selecting useful comments. The selection process is optimized based on our north star online metric to measure success, code resolution rate (CRR), which is similar to precision used in offline metrics. We would like comment ranker to predict how likely a given code review comment can lead to code resolution.

“Code resolution” is a binary outcome. It is defined by whether the PR author made code changes after the comment was created on the line of code where the comment was made. A positive signal is a comment that triggers the code changes (the comment will then show “outdated” status in Bitbucket Cloud), which very likely suggests that the comment delivered value to the author.

In machine learning, ground truth refers to the correct, real, or authoritative labels or outcomes used to evaluate or train a model. It matters because the model learns by comparing its predictions to the ground truth. We used code resolution binary outcome from internally-sourced dogfooding data as the ground truth to train our model. It provides a large amount of actual user action signals with more than 53K+ code reviewer comments captured along with their associated code resolution outcomes. Its scope was largely enhanced when we rolled out to all eligible Atlassian internal repos.

The ML problem can be framed as a classic binary classification task, in which we want to predict code resolution outcome and its output include a propensity score from 0 to 1. Then, any comment that we want to post has to pass a threshold score, chosen and optimized through A/B testing.

We currently only use the LLM generated comment content in its raw textual format. We have seen it’s performing well so far and we are also exploring adding more signals to enhance the model as discussed in “Future Work” session.

Given the above task, input format and recent success in transformer-based natural language processing (NLP) models, we selected the open source ModernBERT model which is a recent high-performing variant of the BERT model family. BERT has been one of the most widely used open source models that has proved successful in various classification tasks in language understanding. Its underlying transformer architecture is the core deep learning model architecture that powers how most modern LLMs understand and generate language.

The out-of-the-box BERT model is not typically trained with similar data and tasks we have. Model fine-tuning based on our own data is necessary to ensure that the propensity scores generated for code resolution are useful and accurate. Below is high-level summary of the fine-tuning steps:

Download a pre-trained ModernBERT model (i.e. from HuggingFace, a GitHub-like platform for hosting and sharing ML models).
Specify text input (code review comment) and ground truth label (code resolution binary outcome).
Tokenize the text (converts text into a sequence of numbers using a tokenizer which can also be downloaded from HuggingFace).
Customize the model for our classification task by defining a classification neural network layer on top.
Train the model. The model will generate a prediction, compares that to the ground truth, adjusts weights to reduce error (via back-propagation), and keep doing that over several passes (epochs).
Evaluate the model. After training, we evaluate it on a “hold-out” dataset to see how well it generalizes on unseen data.

Model fine-tuning can be time and resource heavy since the model needs to go through multiple training iterations through the entire dataset to keep track of and also update hundreds of millions of weight parameters. GPUs are often needed for these fine-tuning tasks since they are optimized to perform parallel floating-point operations, which makes them dramatically faster than CPUs for deep learning tasks. Thanks to the GPU support from our ML platform team, fine-tuning those models can be finished within a few hours.

Model Refresh

Training the comment ranker model is not a once-off effort and it needs constant refresh/retraining due to the following reasons:

It needs to catch up with our ongoing efforts on adopting more upstream SOTA LLM changes.
It needs to leverage ever growing usage data to improve its robustness to handle increasing traffic and newer patterns in the code over time.

This is a typical “data drift” or “model degradation over time” work item. A good example of this happening is when we extended our product enablement from a limited number of repos to all Atlassian repos, we saw the model precision dropped (CRR dropped from ~40% to ~33%) mainly because the first version of the model was trained using a limited number of data signals from those repos. With more and diverse data signals collected from all Atlassian repos after that enablement (review comments in training data increased from ~10K to ~53K), we retrained the model and the model precision quickly recovered (CRR improved back to ~40%).

Impact and User Feedback

After comment ranker shipped, we have been witnessing its crucial impact in improving key online metrics and serving as a reliable and robust quality gate.

Both versions of comment ranker has contributed to increasing CRR to 40% ~ 45% while maintaining similar coverage or prevalence level. The increased CRR is very close to the human benchmark which is about 45%, which was determined internally from our engineers’ Bitbucket usage data.
Comment ranker has shown its robustness in handling new user bases (or code bases). Code reviewer currently has over 400+ external active beta customers and their metrics are even better than metrics from internal users.
Comment ranker has shown its strong adaptability to newer models. When we switched our comment generation model from GPT-4o to Claude Sonnet 3.5, we were earlier concerned that we may see a CRR drop given comment ranker has been only trained using comments generated by GPT-4o. After we made the switch with comment ranker in place, we have seen the CRR stay very consistent before and after. More surprisingly, this before-after consistency also holds true when we break down the metrics by different categories (e.g. code bugs, code readability, code design, etc.). As the diagram below further illustrates, it suggests that comment ranker is largely agnostic to the underlying generation model when selecting good comments.

Other than the impressive quantitative results, we have also collected a large amount of positive qualitative feedback from both internal and external users, sourced from feedback collectors as well as customer interviews conducted by our PMs. A recent external user has mentioned our code reviewer is considered better than competitors like CodeRabbit used in Github by being more “ambient, controlled and careful” as opposed to “exposing a lot more information and comments that overloads the team”. Users also consistently praise how code reviewer “does a good job as an initial reviewer” and “catches parts that are easy for humans to miss”. What stands out most is that “it not only improves code quality but also transforms team dynamics by removing human emotion from reviews, reducing interpersonal friction and making feedback feel more neutral and objective.”

Summary

Comment ranker is the first ML model that was developed by DevAI for improving code review quality. It has quickly become a crucial part in its product development mainly due to its uniqueness in leveraging actual user signals (if a user triggered subsequent code changes after the code reviewer’s comments). Whether the user (our internal users specifically) made code changes to address that comment, or the user chose to ignore that comment, they all turn into valuable signals (ground truth labels) in our model training data to guide the model to better select high quality comments to be posted. So every user has contributed to the success of our code reviewer product!

Thanks to our large Atlassian developer base which generated large amount of internal dogfooding data with rich signals, we were able to build this robust proprietary ML model which can serve as a moat to really differentiate ourselves from LLM vendors or other code review competitors. Instead of getting replaced by LLMs as a lot of existing processes or workflows do, our comment ranker actually complements LLMs pretty well. With its presence and quality assurance, we have smoothly transitioned from GPT-4o to Sonnet 3.5, and also from Sonnet 3.5 to Sonnet 4 with much enhanced quality improvement and user experience. We are definitely confident it will keep delivering its value and complementing any upcoming newer LLMs.

Future Work

Other than refreshing/retraining comment ranker at timely cadence, we are also working on feature engineering efforts which will help to generate and leverage more data signals. Examples of such data signals are raw or derived signals from code diff, comment category, file extension, PR title, PR description, Jira title and description etc.

By leveraging a wider variety of data signals and our consistently increasing dataset size, we expect to deliver significant performance increases to this model, resulting in more useful and time-saving comments for our customers.

Comment ranker – An ML-based classifier to improve LLM code review quality using Atlassian’s proprietary data

How the Comment Ranker Works

Model Refresh

Impact and User Feedback

Summary

Future Work

Ways of Working

Wellbeing | Well-doing

The Flywheel Growth Model

Comment ranker – An ML-based classifier to improve LLM code review quality using Atlassian’s proprietary data

Comment ranker – An ML-based classifier to improve LLM code review quality using Atlassian’s proprietary data

How the Comment Ranker Works

Model Refresh

Impact and User Feedback

Summary

Future Work

DevMate: Accelerating React Native Development at Atlassian

1 Billion Build Minutes Later: How we reinvented CI/CD at Atlassian

Improving Coding Agent Experience

Comment ranker – An ML-based classifier to improve LLM code review quality using Atlassian’s proprietary data

Ways of Working

Wellbeing | Well-doing

The Flywheel Growth Model

Comment ranker – An ML-based classifier to improve LLM code review quality using Atlassian’s proprietary data