Speech Editing at Loom

Speech editing with voice clone at Loom allows users to instantly update parts of their video without re-recording, making video creation and editing seamless and efficient. This technology is pivotal for creating high-fidelity, evergreen video content (e.g., for training and enablement) at scale, significantly reducing the need for costly and time-consuming re-recordings.

Features like Audio Variables, which enable users to record once and personalize videos by swapping names or company details in their own voice, are powered by this underlying speech editing capability. The following examples will illustrate how the technology is used to modify speech, such as substituting names of people or companies, while maintaining a natural and coherent audio output.

Example 1

The Original

Hey team, how’s it going? I wanted to broker a quick introduction. My name is Christo, and I lead the relationship between ScaleAI and Loom.
Amazon – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Jeff, and I lead the relationship between Amazon and Loom.

Google – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Sundar, and I lead the relationship between Google and Loom.
Meta – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Mark, and I lead the relationship between Meta and Loom.
Microsoft – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Satya, and I lead the relationship between Microsoft and Loom.
Apple – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Tim, and I lead the relationship between Apple and Loom.
Atlassian – Hey team, how’s it going? I wanted to broker a quick introduction. My name is Mike, and I lead the relationship between Atlassian and Loom.

Example 2

The Original

Hey, Mark, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Notion integration.
Confluence – Hey, Ashwin, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Confluence integration.
Jira – Hey, Luis, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Jira integration.
Trello – Hey, Sean, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Trello integration.
Bitbucket – Hey, Elizabeth, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Bitbucket integration.
Opsgenie – Hey, Laura, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom Opsgenie integration.
SourceTree – Hey, Emily, it’s Sasha here. I’m excited to connect with you. I wanted to speak to the Loom SourceTree integration.

In January 2025, we launched the audio variables feature (available on Business+ AI and Enterprise plans) to the general audience. With Video {Variables}, users can record once and personalize videos (e.g., swap in different names or company names) in their own voice, enabling hyper-personalized outreach or communication without repetitive manual work. https://support.loom.com/hc/en-us/articles/14974723544733-How-to-personalize-videos-at-scale-Beta.

This document dives into the details of the AI model architecture behind speech editing at Loom.

The Unique Challenges of Speech Editing

While Text-to-Speech (TTS) technology is well-established, speech editing presents distinct and more complex challenges. Traditional TTS typically generates entire speech segments from scratch, often in a free-form manner without the tight constraints of integrating into pre-existing audio. In contrast, speech editing involves modifying specific portions of existing audio. It is desirable to limit the “change surface” to the absolute minimum necessary. A surgical approach helps ensure that any potential artifacts or unnaturalness in the generated audio are less noticeable because the surrounding, original audio remains untouched, providing a strong, coherent anchor.

The speech editing process requires three key inputs:

  1. Original Audio Waveform: The raw audio recording that needs to be edited.
  2. Original Transcript: A textual representation of the original audio.
  3. Modified Transcript: The desired new version of the transcript, reflecting the changes to be made to the audio.

The core challenge, therefore, is to implement these precise edits while maintaining coherence at the boundaries of the edited segments and preserving the original speaker’s tempo, tone, and overall vocal style. The aim is for the modified speech to sound as if the person spoke it naturally as part of the original recording.

Simply generating an audio clip of the new speech using traditional TTS methods and inserting it would likely sound out of place due to a lack of coherence with the surrounding acoustic and prosodic context.

Zero-Shot Learning: A Universal Model

A primary goal was to create a universal model capable of generalizing to new input data (voices and acoustic environments) not encountered during its training phase—a concept known as zero-shot learning.

Many commercial voice cloning solutions use an overfitting or fine-tuning approach, where a model is trained or explicitly fine-tuned for each user’s voice using a sample of their speech (ranging from tens of seconds to minutes). While this can deliver high-quality voice cloning, it has several drawbacks:

  • User Experience: New users cannot immediately use the feature; they must provide a speech sample and wait for a custom model to be trained, a process that can take anywhere from minutes to hours.
  • Data Security and Trust: This necessitates additional processes and systems to safeguard user speech samples and/or speaker embeddings, which adds to operational complexity and potentially raises questions about user trust.
  • Engineering Complexity: It involves coordinating data collection, model training, data lineage, and data retention for individual users.
  • Inference Delays: Different pre-trained model checkpoints need to be retrieved from storage for each user, as pre-trained checkpoints cannot be shared among users.
  • Cost: Training and storing individual models incur significant expenses. This necessitates charging a premium for the feature, making it less accessible to the general audience.

In contrast, zero-shot learning, while challenging and requiring powerful models trained on massive datasets, circumvents these issues. A single, robust model can serve all users without needing their specific training data. This simplifies model training and management, reduces inference delays (as the same model can remain loaded in memory), and significantly lowers per-user costs, making advanced features more accessible.

Masked Acoustic Modeling (MAM)

The core of our speech editing system is the acoustic model, which is based on a similar architecture to the Voicebox by Meta AI Research. This model utilizes a Masked Acoustic Modeling (MAM) training technique. This approach, analogous to Masked Language Modeling (MLM) used in models like BERT, is particularly well-suited to the demands of speech editing.

MAM defines the problem as an “infilling” task. During training, sections of the input audio are deliberately removed (masked), and the model is tasked with predicting or reconstructing this missing audio content based on the surrounding, unmasked audio. This directly mirrors the core requirement of speech editing: to generate new audio for a specific segment that is highly coherent with the existing, unaltered portions of the original recording. By training on a vast amount of data in this manner, the model learns not only to generate speech but also to generate speech that seamlessly integrates with the acoustic properties (such as timbre, pitch, and background noise) and prosodic features (like tempo and intonation) of the provided context. This makes MAM an effective problem definition because it inherently teaches the model to maintain the crucial coherence and naturalness needed when modifying only parts of an audio track, rather than generating it entirely in isolation.

Speech Editing: End-to-End Workflow

The speech editing workflow, designed to support the acoustic model, involves several interconnected stages and auxiliary components to ensure seamless application in a production environment:

The speech editing workflow involves several interconnected stages:

1. Waveform to MelSpectrogram Conversion

The workflow begins by processing the Original Audio Waveform. The encoder component of a vocoder transforms the raw, one-dimensional audio waveform data into an intermediate, more compact representation—a MelSpectrogram. This signal processing step converts the speech from the time domain to the frequency domain, representing it as a two-dimensional “image” where one axis represents time and the other represents frequency, and the intensity corresponds to amplitude. This conversion reframes the audio editing task into an image editing problem, specifically “image infilling,” analogous to techniques used in models like Image Stable Diffusion.

2. Transcripts Processing

Both the Original Transcript and the Modified Transcript must be prepared for the speech models. This involves two sub-steps:

a. Text Normalization

Transcripts are often in written form rather than spoken form, which can complicate downstream tasks like phonemization and TTS. We normalize input transcripts to improve TTS accuracy. Examples include:

  • Numbers: “12, 13, 43” → “twelve, thirteen, forty-three”
  • Currency: “It costs $34.98” → “It costs thirty-four dollars and ninety-eight cents.”
  • Abbreviations: “St. Patrick’s Day” → “Saint Patrick’s Day”
  • Dates: “Today is Jan. 01, 2022.” → “Today is January first, twenty twenty-two.”

Written forms can be ambiguous (e.g., “St.” could be “Street” or “Saint”; “2022” could be “two thousand and twenty-two” or “twenty twenty-two”). We utilize the NeMo-text-processing library, which enables context-aware processing to minimize such ambiguities. While perfection is unattainable, the level of ambiguity is generally manageable.

b. Phonemization

After text normalization, the spoken-form English sentences are converted into the International Phonetic Alphabet (IPA). This helps the TTS system infer the pronunciation of words (e.g., “Atlassian” → ætlˈæsiən, “Confluence” → kˈɑːnfluːəns).

We utilize a phonemizer that uses Espeak as its backend. Espeak features a robust rule-based phonemizer that, anecdotally, surpasses many table-based and ML-based phonemizers.

However, several challenges remain with this approach.

  • Accents: Ambiguities can arise due to different accents (e.g., “Dance”: dˈæns in US English vs. dˈans in British English; “Can’t”: kˈænt in US English vs. kˈɑːnt in British English). Accents from regions like South Africa, India, and Singapore add further complexity. Without prior knowledge of users’ accents, we can only rely on the acoustic model to predict the accents and generate the correct vocals.
  • Names: Rule-based models may struggle with names, especially those of non-Western origin. To mitigate this, we maintain a custom phonetic spelling lookup table, creating specialized rules for specific names.

Upon completion of these sub-steps, we have two sequences of phonemes: one for the original audio and one for the desired modified audio.

3. Forced Alignment

This stage aligns the acoustic features of the original audio (represented by the MelSpectrogram from Step 1) with its textual representation (the phoneme sequence from Step 2b). A Forced Aligner determines when each phoneme in the original transcript occurs in the audio. Without an accurate alignment, subsequent editing and masking processes will not target the correct audio segments and therefore will not work effectively.

The Forced Aligner is an ML model that maps each phoneme to its corresponding segment (time frames or “columns”) in the MelSpectrogram, establishing precise start and end times. The primary output is a set of durations for each phoneme.

For instance, it might determine that the phoneme /iː/ in the word “thirteen” corresponds to 26 columns (time frames) in the spectrogram, where each column typically represents a small unit of time (e.g., approximately 10 milliseconds). This phoneme-level timing information is vital for knowing which audio segments to edit.

Visualization of an aligned phoneme sequence

The Forced Aligner
Traditional forced aligners (e.g., HTK, Montreal Forced Aligner) are often HMM-based and suited for offline batch processing. While these aligners demonstrate satisfactory performance and have been widely adopted by researchers, they are not ideal for real-time production usage.
Our initial attempt to build a forced aligner using Wav2Vec pre-trained models also yielded unsatisfactory performance and accuracy.
Ultimately, we developed and trained our own forced aligner based on the “One TTS Alignment to Rule Them All” paper from NVIDIA. We also incorporated a few key modifications to the original model architecture:
The original paper’s static “prior” term, intended to favor diagonal alignment and speed convergence. In practice, it proved problematic. For speeches with long silences, it could actually slow convergence. Consequently, we removed this term.
We replaced the convolutional network component of the aligner with a transformer architecture, inspired by the Vision Transformer (ViT), resulting in faster convergence and improved accuracy.

This alignment information is set aside for constructing the “masked spectrogram condition” in Step 6.

4. Sequence Matching

To precisely identify the edits needed, the phoneme sequence from the original transcript is compared against the sequence from the modified transcript (both processed in Step 2) using a sequence matcher.

This process is analogous to a “diff” operation in code version control. It identifies the differences between the two phoneme sequences, categorizing each change as:

  • Unchanged: Segments identical in both transcripts.
  • Inserted: New words/phrases in the modified transcript not present in the original.
  • Deleted: Words/phrases in the original but removed in the modified transcript.
  • Replaced: Original words/phrases deleted and new ones inserted in their place.

The output, detailing the type and location of each edit, serves as a blueprint for the subsequent construction of a masked MelSpectrogram. However, there’s one more piece of information needed before we can proceed.

5. Duration Prediction

When words or phrases are inserted or replaced, the spoken duration of the new phonemes is unknown. A Duration Predictor model estimates these durations. For example, “fourteen” and “fifteen” have different phoneme durations. The model takes the phoneme sequence (with known durations for unchanged parts, as determined by forced alignment in Step 3, and masked or unknown durations for new or changed parts) and predicts missing durations, considering the context of surrounding phonemes.

The Voicebox paper proposed two implementations: one using Continuous Normalizing Flows (CNF) and another simpler regression model akin to FastSpeech2. We opted for the FastSpeech2 style regression-based duration predictor model for its simplicity and fast inference.

6. Constructing the Masked MelSpectrogram

With alignment, sequence matching, and duration prediction complete, all necessary information is gathered for the construction of the “masked MelSpectrogram.”, which will serve as the input to the acoustic model.

  • Unchanged Segments: The corresponding segments from the original MelSpectrogram are directly copied (or “grafted”) into the new MelSpectrogram.
  • Deleted Segments: The corresponding segments in the original MelSpectrogram are omitted from the new MelSpectrogram. This results in a shorter MelSpectrogram.
  • Inserted Segments: A new, “masked” (initially zero-filled) segment is added. Its length is determined by the predicted durations of the new phonemes (from Step 5), ensuring adequate space for generation.
  • Replaced Segments: This combines deletion and insertion. The original segment in the MelSpectrogram is omitted, and a new masked segment is inserted in its place. The predicted duration determines the length of this new segment.

The resulting masked MelSpectrogram is a composite of the preserved original audio, adjusted for deletions, and masked areas for new or replaced content. This, along with the target phonetic information for the masked regions, is then passed to the acoustic model.

7. Infilling the Masked Spectrogram

This stage is where the new audio content is generated. The masked spectrogram and phoneme information are fed into the acoustic model.

The acoustic model uses a Continuous Normalizing Flow (CNF) model trained with flow matching. Flow matching is a simulation-free approach that regresses vector fields of conditional probability paths, enabling efficient CNF training by learning to transform a simple noise distribution to the complex data distribution of speech.

The process starts with random noise in the masked sections. Flow matching trains the model to predict vector fields at each timestep, conditioned on aligned phonemes (target sounds) and surrounding speech context. An Ordinary Differential Equation (ODE) solver then integrates these vector fields to generate the final audio. This method typically requires fewer inference steps than traditional diffusion models.

The acoustic model “infills” these masked sections, generating new audio matching the target phonemes while maintaining consistency in vocal style, tempo, and speaker characteristics with the surrounding original audio. The output is a complete, new MelSpectrogram with the edits seamlessly incorporated. The following table illustrates the evolution of the MelSpectrogram over eight timesteps, from random noise to speech.

8. Waveform Reconstruction (Vocoder Decode)

The final step converts the generated mel spectrogram back into a one-dimensional audio waveform using a vocoder decoder. This reverses the initial encoding process (Step 1), often employing a neural network to predict and fill in any missing information, thereby accurately reconstructing the waveform.

We’ve experimented with different vocoders. Initially, ParallelWaveGAN was used, and later we transitioned to Vocos and BigVGAN for improved performance. Practically, we found the choice of vocoder had little impact on the final output quality.

Safeguards

Speech editing systems can pose a significant ethical risk if misused. The core risk is that someone could use this technology to impersonate another person’s voice without their knowledge or consent. To mitigate some of the risks, we implemented various product-level and data-level safeguards.

Product-Level Restrictions

  • Speech editing is only available to the creator of a Loom video, even if others have edit rights. This prevents others from using the feature to edit someone else’s speech.
  • Speech editing is not available on uploaded videos or meeting recordings to avoid editing speeches that do not belong to the creator.
  • Explicit User Acknowledgement: When using speech editing, users must confirm that the voice being altered is their own.

Data Use and Retention

  • No user data or user-generated content (UGC) is used to train these AI models for speech editing. The current models are trained solely on publicly available open datasets.
  • No personal voice data stored. The system doesn’t save or retain any voice data that could identify users as specific individuals. The model cannot recreate users’ voices without a reference speech sample.

Usage Tracking

  • Editing logs for text-to-speech are kept for potential future audits.
  • Responsible technology reviews and user research are performed to identify and address emerging risks as the technology develops.

Afterword

The development of Loom’s speech editing technology reflects a commitment to pushing the boundaries of video communication, making it more fluid and less constrained by the finality of a single recording session. By architecting a system that can intelligently and coherently modify spoken audio, we aim to empower users, giving them greater flexibility and control over their content long after the initial recording.

As AI continues to evolve, the principles guiding this work—prioritizing user experience, ensuring scalability, and building for seamless integration—will remain central to its development. The journey of refining and expanding these capabilities is ongoing, with the ultimate goal of making video an even more powerful and adaptable medium for connection and expression.

How Liam uses Loom to replace documentation and avoid unnecessary meetings.
It has been very fun!

The AI Technologies Behind Speech Editing at Loom