Voice Changer + Runway Act-One: Full Workflow

Learn how to combine a real-time voice changer with Runway ML Act-One to produce character-driven AI short films with perfectly matched voice and performance.

Voice Changer + Runway Act-One: Complete Workflow for AI Short Films

Runway ML’s Act-One feature changed what solo creators can achieve. Record yourself acting a scene — just a phone camera and natural light — and Act-One maps your facial performance onto any character in a generated video. The missing piece for most indie filmmakers is audio: Act-One handles the face, but the voice that comes out of your mouth still sounds like you.

A real-time voice changer closes that gap. Record your reference video with the voice already transformed, and the output clip ships with a character voice baked in — no post-processing, no overdub session.

This guide walks the complete workflow: choosing presets by character archetype, setting up the audio chain so Runway captures cleanly, and assembling everything in a video editor for distribution.


TL;DR

  • Runway Act-One reads facial motion from a reference video and maps it to a generated character.
  • A real-time voice changer running through a virtual microphone lets you record the reference video with character audio already applied.
  • The audio track from your reference recording becomes the final dialogue — Act-One does not touch audio.
  • Match your voice preset to your character archetype before you hit record.
  • VoxBooster’s WASAPI virtual mic is recognized by OBS, webcam software, and screen recorders without driver installation.
  • Final assembly is straightforward: import the Act-One video output, sync the processed audio track, color grade, and export.

What Is Runway Act-One?

Runway ML is a generative AI platform used by filmmakers, VFX studios, and content creators for video generation and editing tasks. Act-One is a specific feature that performs facial motion transfer: it analyzes a reference video of a human performer and drives the facial animation of a character in a generated output clip.

The workflow differs from pure text-to-video. Instead of describing movement in a prompt, you embody it. Your eyebrow raises, lip sync, and head tilts become the character’s expressions. This produces significantly more natural and emotionally coherent animation than prompt-only generation, because the source of truth is real human performance data.

Act-One joins a broader set of tools — including Runway Gen-4, green screen tools, and in-painting — that together function as a complete production pipeline for AI-assisted film.


Why Audio Is the Overlooked Layer

When creators first try Act-One, the usual result is visually impressive but aurally jarring. The character’s face moves with the actor’s expressiveness, but the voice is recorded raw — natural human timbre, no transformation — and pasted under the generated footage. The disconnect is immediate.

The conventional fix is post-production voice processing: record clean, then run the audio through effects afterward. This works, but it creates a synchronization problem. Lip sync in Act-One depends on the reference video. If you record a subtle performance and then add heavy vocal processing afterward — extending vowels, adding formant shift — the mouth movement on the character no longer matches the processed audio.

Recording with the voice changer applied in real time solves this. You hear the transformed voice in your headphones while performing, which naturally shapes your mouth movements and pacing to match the processed audio. Act-One captures those adjusted movements. The result is tighter lip sync in the generated output.


How Runway Act-One Reads the Reference Video

Understanding the input format helps you record better reference footage.

Act-One performs face-tracking on the reference clip. It expects:

  • Frontal or near-frontal angle — profiles reduce accuracy significantly. Aim for your face centered in frame, camera at eye level.
  • Consistent lighting — harsh shadows across the nose or eyes interfere with landmark detection. Soft frontal light (ring light, window light) is ideal.
  • Minimal background motion — people walking behind you or moving objects can confuse the tracker.
  • Clear lip visibility — beards and microphones in front of the mouth reduce lip-sync fidelity.
  • 720p or higher, 24fps or 30fps — lower resolution reduces tracking precision.
  • MP4 container — most reliable for the upload pipeline. MOV also works.
  • Under 30 seconds per take — Act-One processes efficiently at this length; longer clips are possible but increase generation queue time.

The audio track in the reference video is not analyzed by Act-One itself. The generation is driven purely by visual data. This means the voice changer output in your audio track has zero effect on the facial animation quality — the two layers are completely independent.


Character Archetypes and Voice Preset Pairing

The strongest Act-One films have sonic coherence: the voice fits the character before a single line of dialogue is written. Here is a practical pairing guide.

Character archetypeRecommended voice treatmentNotes
Armored warrior / knightPitch down 3-5 semitones + mild room reverbAdds weight; reverb simulates helmet resonance
Supernatural / ethereal beingSlow pitch modulation + formant upCreates an unsettled, otherworldly texture
Robot / AI constructHard vocoder or bit-crush presetWorks best with crisp, deliberate delivery
Ancient evil / villainHeavy pitch down + subtle chorusChorus adds the sense of multiple voices
Young hero / chosen oneSlight pitch up + minimal processingPreserve emotional range; don’t over-process
Alien diplomatFormant shift + light stereo widthKeeps speech intelligible while sounding non-human
Narrator / oraclePitch down 2 semitones + long reverb tailEpic documentary energy

The table is a starting point, not a rulebook. Blend presets and trust your ear during the performance. If the voice feels right through your headphones while you are acting, it will feel right in the final film.


Setting Up the Audio Chain

The goal is to route processed audio into both your recording software (for the reference video audio track) and your monitoring headphones (so you hear yourself in character while performing).

Step 1 — Install and configure the voice changer

Install VoxBooster on Windows 10 or 11. No kernel driver is required — the WASAPI virtual microphone appears in Windows sound settings as a standard input device within seconds of first launch.

Open VoxBooster, select your physical microphone as the input source, and choose a preset from the archetype table above. Verify the output is routing to VoxBooster Virtual Mic in the output selector.

Step 2 — Set monitoring

In VoxBooster’s settings, enable headphone monitoring. You should now hear your transformed voice in real time through your headphones. Latency for DSP presets is under 20ms — imperceptible during performance. AI voice cloning mode adds a brief processing window (under 300ms end-to-end), which some performers find slightly disorienting at first; rehearse a few lines before the take.

Step 3 — Configure the recording software

Open your screen recorder or webcam capture app (OBS, Windows Camera, Loom, or similar). In the audio input settings, select VoxBooster Virtual Mic instead of your physical microphone. This ensures the recording captures the processed voice, not the raw input.

If you are using OBS:

  1. In Sources, add an Audio Input Capture source.
  2. In the source properties, select VoxBooster Virtual Mic from the device dropdown.
  3. Add a Video Capture Device source pointed at your webcam.
  4. Start recording. Both streams write to the same output file.

Step 4 — Record the reference take

Keep the take short — 10 to 25 seconds is the sweet spot for Act-One. Perform naturally, maintaining eye contact with the camera lens. Speak the dialogue aloud with full commitment to the character; Act-One reads emotional intensity through your facial muscle movement.

After recording, verify the output file: the audio track should contain the processed voice, not the raw microphone feed. Play the file back in a media player before uploading to Runway.


Uploading to Runway Act-One and Generating Output

Log into your Runway account and navigate to the Act-One feature. The interface asks for two inputs:

  1. Reference video — your recorded performance clip with processed audio.
  2. Character source — either a generated image from Gen-4, an uploaded character render, or a prior generation output.

Upload the reference video. Act-One extracts the facial motion data during its analysis pass. Then select or generate your character. Configure generation settings (aspect ratio, style guide, any prompt guidance for the scene environment).

Submit the generation. Queue times vary by plan and platform load. While waiting, you can prepare post-production assets: any scene background elements, title cards, or music tracks.

When the output clip downloads, it contains the character video driven by your performance. The audio track in the downloaded file may be silent or may carry through your reference audio depending on Runway’s pipeline version. In either case, your next step is the video editor, where you will assemble the final composite.


Post-Production Assembly

Open your video editor (DaVinci Resolve, Premiere Pro, CapCut, or any NLE). Create a new project matching your target output specs (typically 1920×1080 or 1080×1920 for vertical, 24fps).

Track layout:

TrackContent
V1Act-One generated character video
V2Background plates or environment footage
A1Processed audio from reference recording
A2Music / ambient sound
A3Optional SFX layers

Sync the processed audio from your reference recording to the character video on V1. Because you recorded audio and video simultaneously in the reference take, the sync is already baked in — you should not need to adjust it manually unless the upload pipeline trimmed a few frames.

Add background plates, color grade the character clip to match, and mix the audio. Export at H.264 or H.265 for upload to YouTube, TikTok, or Instagram.


Common Problems and Fixes

Act-One output has stiff or uncanny facial motion Usually caused by tracking issues in the reference video. Check lighting uniformity and ensure no strong shadows cross the face. Re-record with a softer light source.

Lip sync drifts in the generated video Confirm that your reference audio and video were recorded simultaneously and in sync before upload. A drift in the source file will amplify in the output. If you recorded audio separately and merged it, ensure the merge was frame-accurate.

Voice changer adds noticeable latency during performance DSP presets run under 20ms and are essentially imperceptible. If you notice delay, check whether your audio interface’s buffer size is set too high — reduce the WASAPI buffer in your recording software to 128 or 256 samples.

The processed voice sounds over-compressed or distorted in the final clip Your voice changer gain staging may be too hot. Lower the output level in VoxBooster until the signal peaks around -6 dBFS. This leaves headroom for the video editor’s audio processing.

Act-One does not accept the uploaded reference video Ensure the file is MP4 (H.264), resolution is at least 720p, and duration is under the documented limit for your Runway plan. Re-encode with HandBrake if the original capture software produced an unusual container.


Full Production Checklist

Use this checklist per scene before uploading to Runway.

  • Preset chosen and rehearsed in character
  • Headphone monitoring confirmed (hearing transformed voice)
  • Recording software set to VoxBooster Virtual Mic input
  • Lighting checked — even, frontal, no strong shadows on face
  • Background clear — no moving objects
  • Test take recorded and played back — audio is processed, not raw
  • Take duration under 30 seconds
  • File exported as MP4 H.264, 720p minimum
  • File plays correctly in media player before Runway upload

Scaling to a Multi-Scene Short Film

Indie AI filmmakers often hit the same wall: the first test clip looks great, but producing a coherent 3-to-5 minute short requires consistency across many clips. A few practices help.

Character voice consistency — save your preset configuration before you start production. Every take for the same character uses the identical preset and gain settings. Even small changes in pitch shift amount will be noticeable across cuts.

Reference video consistency — use the same camera position, lens, and lighting setup for every take featuring the same character. Act-One will produce more coherent facial style across the generated clips.

Batch processing — record all takes in a single session if possible. Consistent acoustic environment (same room, same microphone position) keeps the processed audio tonally uniform.

Audio mixing — because all dialogue was processed with the same preset, EQ and compression settings only need to be set once on the A1 bus and applied uniformly to all scenes.

Runway’s own documentation and community showcase (runwayml.com) contains examples of extended Act-One projects for reference. Runway as a company is also covered in detail on Wikipedia, including its development history and the research context behind the motion-transfer techniques used in Act-One.


Why Voice Changer Quality Matters for Act-One Work

Act-One elevates indie film production to a level where audio quality becomes the bottleneck. Generated character video at this fidelity deserves an audio track that matches. Basic pitch-shift plugins produce metallic artifacts that clash with high-quality visual output. The reference recording is also the final audio track — there is no re-recording session — so the capture quality is permanent.

VoxBooster processes audio at sub-300ms end-to-end for AI voice cloning and under 20ms for DSP presets, which is fast enough for natural performance. The WASAPI virtual microphone is recognized by Windows without driver installation and appears cleanly in OBS, webcam software, and screen recorders. The result is a voice track that holds up alongside the visual output rather than undermining it.

Pricing starts at $6.99/month. A free trial covers a full production test before committing.


FAQ

What is Runway Act-One and how does it use a reference video? Act-One is a feature inside Runway ML that transfers a human actor’s facial expressions and head movements onto a generated character. You supply a short reference video of yourself performing — Act-One reads your facial motion and maps it to the character. The better the performance, the more expressive the output.

Can I use a voice changer while recording the Act-One reference video? Yes. Because Act-One only analyzes facial geometry and motion, not audio pitch, you can run a real-time voice changer through a virtual microphone and record both video and processed audio simultaneously. The audio you capture becomes the final dialogue track; Act-One handles the visual side independently.

What voice presets work best for fantasy or sci-fi characters in Act-One? For armored heroes or warriors, a pitch-down preset with light reverb seats the character in space. For supernatural or ethereal characters, slow pitch modulation or formant-shift creates an otherworldly texture. Robotic presets work for mechs or AI characters. The key is matching the preset’s energy to the character archetype you perform in the reference footage.

Does Runway Act-One require a specific reference video format? Act-One works best with a well-lit frontal shot, face clearly visible, minimal background clutter. Resolution 720p or higher is recommended. MP4 is the most reliable container. Keep clips under 30 seconds for the initial reference take — you can chain multiple takes for longer scenes.

What is WASAPI and why does it matter for recording voice changer output? WASAPI (Windows Audio Session API) is a low-latency audio interface built into Windows 10/11. A voice changer that exposes a WASAPI virtual microphone lets any recording app — including screen recorders and webcam software — capture the processed voice at near-zero latency with no driver installation required.

Do I need a powerful PC to record Act-One reference videos with a real-time voice changer? A mid-range CPU handles real-time DSP effects at sub-20ms latency without noticeable load. AI voice cloning inference adds GPU load; a dedicated GPU helps but is not mandatory. The reference recording step is typically short (under 30 seconds), so even on modest hardware the performance cost is brief.

Can this workflow be used for long-form AI films or only short clips? Act-One is optimized for short-to-medium clips, and Runway’s generation queue favors clips under one minute. For longer films, the standard approach is scene-by-scene production: record a reference take per scene, generate each output clip, then assemble in a video editor. The voice changer runs once per take and the processed audio is exported with each clip.

Try VoxBooster — 3-day free trial.

Real-time voice cloning, soundboard, and effects — wherever you already talk.

  • No credit card
  • ~30ms latency
  • Discord · Teams · OBS
Try free for 3 days