The AI Technologies Behind Speech Editing at Loom

The AI Technologies Behind Speech Editing at Loom

Speech editing at Loom lets users update parts of their video instantly, without re-recording. Powered by advanced AI, this technology enables seamless, high-quality edits—like swapping names or company details in your own voice—making video creation faster, more flexible, and highly personalized.

Speech Editing at Loom

Speech editing with voice clone at Loom allows users to instantly update parts of their video without re-recording, making video creation and editing seamless and efficient. This technology is pivotal for creating high-fidelity, evergreen video content (e.g., for training and enablement) at scale, significantly reducing the need for costly and time-consuming re-recordings.

Features like Audio Variables, which enable users to record once and personalize videos by swapping names or company details in their own voice, are powered by this underlying speech editing capability. The following examples will illustrate how the technology is used to modify speech, such as substituting names of people or companies, while maintaining a natural and coherent audio output.

Example 1

The Original

https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/example-1.mp3
Hey team, how’s it going? I wanted to broker a quick introduction. My name is Christo, and I lead the relationship between ScaleAI and Loom.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/amazon.mp3
Amazon – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Jeff, and I lead the relationship between Amazon and Loom.

https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/google.mp3
Google – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Sundar, and I lead the relationship between Google and Loom.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/meta.mp3
Meta – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Mark, and I lead the relationship between Meta and Loom.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/microsoft.mp3
Microsoft – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Satya, and I lead the relationship between Microsoft and Loom.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/apple.mp3
Apple – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Tim, and I lead the relationship between Apple and Loom.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/atlassian.mp3
Atlassian – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Mike, and I lead the relationship between Atlassian and Loom.

Example 2

The Original

https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/example-2.mp3
Hey, Mark, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Notion integration.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/confluence.mp3
Confluence – Hey, Ashwin, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Confluence integration.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/jira.mp3
Jira – Hey, Luis, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Jira integration.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/trello.mp3
Trello – Hey, Sean, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Trello integration.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/bitbucket.mp3
Bitbucket – Hey, Elizabeth, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Bitbucket integration.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/opsgenie.mp3
Opsgenie – Hey, Laura, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Opsgenie integration.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/sourcetree.mp3
SourceTree – Hey, Emily, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom SourceTree integration.

In January 2025, we launched the audio variables feature (available on Business+ AI and Enterprise plans) to the general audience. With Video {Variables}, users can record once and personalize videos (e.g., swap in different names or company names) in their own voice, enabling hyper-personalized outreach or communication without repetitive manual work. https://support.loom.com/hc/en-us/articles/14974723544733-How-to-personalize-videos-at-scale-Beta.

This document dives into the details of the AI model architecture behind speech editing at Loom.

The Unique Challenges of Speech Editing

While Text-to-Speech (TTS) technology is well-established, speech editing presents distinct and more complex challenges. Traditional TTS typically generates entire speech segments from scratch, often in a free-form manner without the tight constraints of integrating into pre-existing audio. In contrast, speech editing involves modifying specific portions of existing audio. It is desirable to limit the “change surface” to the absolute minimum necessary. A surgical approach helps ensure that any potential artifacts or unnaturalness in the generated audio are less noticeable because the surrounding, original audio remains untouched, providing a strong, coherent anchor.

The speech editing process requires three key inputs:

  1. Original Audio Waveform: The raw audio recording that needs to be edited.
  2. Original Transcript: A textual representation of the original audio.
  3. Modified Transcript: The desired new version of the transcript, reflecting the changes to be made to the audio.

The core challenge, therefore, is to implement these precise edits while maintaining coherence at the boundaries of the edited segments and preserving the original speaker’s tempo, tone, and overall vocal style. The aim is for the modified speech to sound as if the person spoke it naturally as part of the original recording.

Simply generating an audio clip of the new speech using traditional TTS methods and inserting it would likely sound out of place due to a lack of coherence with the surrounding acoustic and prosodic context.

Zero-Shot Learning: A Universal Model

A primary goal was to create a universal model capable of generalizing to new input data (voices and acoustic environments) not encountered during its training phase—a concept known as zero-shot learning.

Many commercial voice cloning solutions use an overfitting or fine-tuning approach, where a model is trained or explicitly fine-tuned for each user’s voice using a sample of their speech (ranging from tens of seconds to minutes). While this can deliver high-quality voice cloning, it has several drawbacks:

In contrast, zero-shot learning, while challenging and requiring powerful models trained on massive datasets, circumvents these issues. A single, robust model can serve all users without needing their specific training data. This simplifies model training and management, reduces inference delays (as the same model can remain loaded in memory), and significantly lowers per-user costs, making advanced features more accessible.

Masked Acoustic Modeling (MAM)

The core of our speech editing system is the acoustic model, which is based on a similar architecture to the Voicebox by Meta AI Research. This model utilizes a Masked Acoustic Modeling (MAM) training technique. This approach, analogous to Masked Language Modeling (MLM) used in models like BERT, is particularly well-suited to the demands of speech editing.

MAM defines the problem as an “infilling” task. During training, sections of the input audio are deliberately removed (masked), and the model is tasked with predicting or reconstructing this missing audio content based on the surrounding, unmasked audio. This directly mirrors the core requirement of speech editing: to generate new audio for a specific segment that is highly coherent with the existing, unaltered portions of the original recording. By training on a vast amount of data in this manner, the model learns not only to generate speech but also to generate speech that seamlessly integrates with the acoustic properties (such as timbre, pitch, and background noise) and prosodic features (like tempo and intonation) of the provided context. This makes MAM an effective problem definition because it inherently teaches the model to maintain the crucial coherence and naturalness needed when modifying only parts of an audio track, rather than generating it entirely in isolation.

Speech Editing: End-to-End Workflow

The speech editing workflow, designed to support the acoustic model, involves several interconnected stages and auxiliary components to ensure seamless application in a production environment:

The speech editing workflow involves several interconnected stages:

1. Waveform to MelSpectrogram Conversion

The workflow begins by processing the Original Audio Waveform. The encoder component of a vocoder transforms the raw, one-dimensional audio waveform data into an intermediate, more compact representation—a MelSpectrogram. This signal processing step converts the speech from the time domain to the frequency domain, representing it as a two-dimensional “image” where one axis represents time and the other represents frequency, and the intensity corresponds to amplitude. This conversion reframes the audio editing task into an image editing problem, specifically “image infilling,” analogous to techniques used in models like Image Stable Diffusion.

2. Transcripts Processing

Both the Original Transcript and the Modified Transcript must be prepared for the speech models. This involves two sub-steps:

a. Text Normalization

Transcripts are often in written form rather than spoken form, which can complicate downstream tasks like phonemization and TTS. We normalize input transcripts to improve TTS accuracy. Examples include:

Written forms can be ambiguous (e.g., “St.” could be “Street” or “Saint”; “2022” could be “two thousand and twenty-two” or “twenty twenty-two”). We utilize the NeMo-text-processing library, which enables context-aware processing to minimize such ambiguities. While perfection is unattainable, the level of ambiguity is generally manageable.

b. Phonemization

After text normalization, the spoken-form English sentences are converted into the International Phonetic Alphabet (IPA). This helps the TTS system infer the pronunciation of words (e.g., “Atlassian” → ætlˈæsiən, “Confluence” → kˈɑːnfluːəns).

We utilize a phonemizer that uses Espeak as its backend. Espeak features a robust rule-based phonemizer that, anecdotally, surpasses many table-based and ML-based phonemizers.

However, several challenges remain with this approach.

Upon completion of these sub-steps, we have two sequences of phonemes: one for the original audio and one for the desired modified audio.

3. Forced Alignment

This stage aligns the acoustic features of the original audio (represented by the MelSpectrogram from Step 1) with its textual representation (the phoneme sequence from Step 2b). A Forced Aligner determines when each phoneme in the original transcript occurs in the audio. Without an accurate alignment, subsequent editing and masking processes will not target the correct audio segments and therefore will not work effectively.

The Forced Aligner is an ML model that maps each phoneme to its corresponding segment (time frames or “columns”) in the MelSpectrogram, establishing precise start and end times. The primary output is a set of durations for each phoneme.

For instance, it might determine that the phoneme /iː/ in the word “thirteen” corresponds to 26 columns (time frames) in the spectrogram, where each column typically represents a small unit of time (e.g., approximately 10 milliseconds). This phoneme-level timing information is vital for knowing which audio segments to edit.

Visualization of an aligned phoneme sequence

The Forced Aligner
Traditional forced aligners (e.g., HTK, Montreal Forced Aligner) are often HMM-based and suited for offline batch processing. While these aligners demonstrate satisfactory performance and have been widely adopted by researchers, they are not ideal for real-time production usage.
Our initial attempt to build a forced aligner using Wav2Vec pre-trained models also yielded unsatisfactory performance and accuracy.
Ultimately, we developed and trained our own forced aligner based on the “One TTS Alignment to Rule Them All” paper from NVIDIA. We also incorporated a few key modifications to the original model architecture:
The original paper’s static “prior” term, intended to favor diagonal alignment and speed convergence. In practice, it proved problematic. For speeches with long silences, it could actually slow convergence. Consequently, we removed this term.
We replaced the convolutional network component of the aligner with a transformer architecture, inspired by the Vision Transformer (ViT), resulting in faster convergence and improved accuracy.

This alignment information is set aside for constructing the “masked spectrogram condition” in Step 6.

4. Sequence Matching

To precisely identify the edits needed, the phoneme sequence from the original transcript is compared against the sequence from the modified transcript (both processed in Step 2) using a sequence matcher.

This process is analogous to a “diff” operation in code version control. It identifies the differences between the two phoneme sequences, categorizing each change as:

The output, detailing the type and location of each edit, serves as a blueprint for the subsequent construction of a masked MelSpectrogram. However, there’s one more piece of information needed before we can proceed.

5. Duration Prediction

When words or phrases are inserted or replaced, the spoken duration of the new phonemes is unknown. A Duration Predictor model estimates these durations. For example, “fourteen” and “fifteen” have different phoneme durations. The model takes the phoneme sequence (with known durations for unchanged parts, as determined by forced alignment in Step 3, and masked or unknown durations for new or changed parts) and predicts missing durations, considering the context of surrounding phonemes.

The Voicebox paper proposed two implementations: one using Continuous Normalizing Flows (CNF) and another simpler regression model akin to FastSpeech2. We opted for the FastSpeech2 style regression-based duration predictor model for its simplicity and fast inference.

6. Constructing the Masked MelSpectrogram

With alignment, sequence matching, and duration prediction complete, all necessary information is gathered for the construction of the “masked MelSpectrogram.”, which will serve as the input to the acoustic model.

The resulting masked MelSpectrogram is a composite of the preserved original audio, adjusted for deletions, and masked areas for new or replaced content. This, along with the target phonetic information for the masked regions, is then passed to the acoustic model.

7. Infilling the Masked Spectrogram

This stage is where the new audio content is generated. The masked spectrogram and phoneme information are fed into the acoustic model.

The acoustic model uses a Continuous Normalizing Flow (CNF) model trained with flow matching. Flow matching is a simulation-free approach that regresses vector fields of conditional probability paths, enabling efficient CNF training by learning to transform a simple noise distribution to the complex data distribution of speech.

The process starts with random noise in the masked sections. Flow matching trains the model to predict vector fields at each timestep, conditioned on aligned phonemes (target sounds) and surrounding speech context. An Ordinary Differential Equation (ODE) solver then integrates these vector fields to generate the final audio. This method typically requires fewer inference steps than traditional diffusion models.

The acoustic model “infills” these masked sections, generating new audio matching the target phonemes while maintaining consistency in vocal style, tempo, and speaker characteristics with the surrounding original audio. The output is a complete, new MelSpectrogram with the edits seamlessly incorporated. The following table illustrates the evolution of the MelSpectrogram over eight timesteps, from random noise to speech.

8. Waveform Reconstruction (Vocoder Decode)

The final step converts the generated mel spectrogram back into a one-dimensional audio waveform using a vocoder decoder. This reverses the initial encoding process (Step 1), often employing a neural network to predict and fill in any missing information, thereby accurately reconstructing the waveform.

We’ve experimented with different vocoders. Initially, ParallelWaveGAN was used, and later we transitioned to Vocos and BigVGAN for improved performance. Practically, we found the choice of vocoder had little impact on the final output quality.

Safeguards

Speech editing systems can pose a significant ethical risk if misused. The core risk is that someone could use this technology to impersonate another person’s voice without their knowledge or consent. To mitigate some of the risks, we implemented various product-level and data-level safeguards.

Product-Level Restrictions

Data Use and Retention

Usage Tracking

Afterword

The development of Loom’s speech editing technology reflects a commitment to pushing the boundaries of video communication, making it more fluid and less constrained by the finality of a single recording session. By architecting a system that can intelligently and coherently modify spoken audio, we aim to empower users, giving them greater flexibility and control over their content long after the initial recording.

As AI continues to evolve, the principles guiding this work—prioritizing user experience, ensuring scalability, and building for seamless integration—will remain central to its development. The journey of refining and expanding these capabilities is ongoing, with the ultimate goal of making video an even more powerful and adaptable medium for connection and expression.

https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/taken_edited.mp4
How Liam uses Loom to replace documentation and avoid unnecessary meetings.
https://atlassian.reaktivdev.com/wp-content/uploads/2025/07/loom-engineering-is-fun.mov
It has been very fun!
Exit mobile version