Loading tutorials…
Loading tutorials…
Your first Synthesia video sets the pattern for every video after it. This is the structure specialists use so video #1 looks like video #100.
Who this is forNew Synthesia users producing their first video. Marketing teams who want to skip the "looks like a hostage video" learning phase. Anyone who wants the production-quality version, not the trial-output version.
What you'll need
Step 1
Scripts written in the editor are 30% longer and 50% less focused than scripts written in a doc first. Write outside, paste in.
Open Google Docs, Notion, or any text tool.
Write 150-200 words for a 60-90 second video. Synthesia avatars read at ~150 words/min.
Read it out loud. Cut any sentence over 20 words. Cut any paragraph over 3 sentences.
Write for the ear, not the eye. Conversational rhythm beats written-prose rhythm every time.
Mark intended pauses with [pause] — Synthesia respects these as natural silence.
Step 2
Avatar selection drives 50% of viewer believability. Match avatar tone to script tone.
Synthesia ships 200+ stock avatars across ages, ethnicities, styles.
For sales outreach: clean, warm, age-matched to your target buyer.
For training: more authoritative, professional dress, neutral background.
For social media: more casual, animated, lifestyle background.
Filter avatars by tone (corporate, casual, friendly) and shot (close-up vs medium).
Preview 5-7 avatars with a 1-line snippet of your script before committing.
Step 3
Voice and avatar should feel like the same person. Mismatched pairing is the #1 reason videos feel uncanny.
For each avatar, Synthesia suggests 2-4 voice matches. Start with those.
Adjust pace: 90% default speed reads more natural for most marketing video.
Adjust emphasis: highlight 3-5 words per paragraph for emphasis using the editor controls.
Add SSML if needed: <break time="500ms"/> for pauses, <emphasis> for stress. Synthesia editor exposes these visually.
Test the first 10 seconds before committing. Voice fit becomes obvious or wrong fast.
Step 4
Single-avatar 90-second videos feel monotone. Cut to b-roll, screen recordings, or text slides every 10-15 seconds.
Scene 1 (0-10s): avatar intro with strong hook. Avatar on screen.
Scene 2 (10-25s): cut to screen recording, product shot, or supporting visual. Avatar may stay as a small picture-in-picture.
Scene 3 (25-50s): back to avatar, main content. Add text overlays for key points.
Scene 4 (50-75s): supporting visual or chart. Show what you are saying.
Scene 5 (75-90s): avatar close-up for CTA. Direct eye contact, slower pacing.
Synthesia ships scene templates — use them as starting points, not finished structure.
Step 5
85% of social video is watched muted. Subtitles are not optional.
Synthesia auto-generates subtitles. Click "Add subtitles" → review.
Review every line for misspelled product names, acronyms, brand terms. Auto-generation gets these wrong often.
Style: white text with black background or shadow. Readable on bright phone screens.
Preview on mobile dimensions (9:16 or 16:9 zoomed) before final render.
Reading speed: subtitles should match avatar pace. Reset timing if anything reads too fast or too slow.
Step 6
First render is rarely the final. Plan for 2-3 iterations before shipping.
Hit Render. 1080p typically takes 5-15 minutes depending on length.
Watch the rendered output on the device your audience will use (phone for social, desktop for B2B).
Note timing of any "uncanny" moments — usually mouth shape on specific words. Adjust pacing or rephrase.
Check audio levels, music balance, intro/outro transitions.
Re-render. Most production-quality videos go through 2-3 render iterations before publication.
Common mistakes
Writing the script in the Synthesia editor
What goes wrong: Script becomes longer and less focused because the editor encourages adding to fill scenes. Final video runs 2-3 minutes when 75 seconds would have been better.
How to avoid: Write the script in a doc first. Tight, conversational, read-aloud-tested. Then paste into Synthesia.
Picking an avatar that does not match the script tone
What goes wrong: Avatar feels uncanny. Viewer cannot articulate why but disengages. View-through rates drop 30-50%.
How to avoid: Preview 5-7 avatars with your script snippet before committing. Match warmth, formality, and energy.
No scene changes — 90 seconds of one avatar talking
What goes wrong: Retention drops sharply after 15 seconds. Average view time hits 30-40% even with great script.
How to avoid: Cut to supporting visual every 10-15 seconds. Even a static text slide breaks the monotony.
Skipping subtitles or shipping with auto-generated ones unedited
What goes wrong: Muted social viewers cannot follow. Brand terms appear misspelled. Embarrassment compounds.
How to avoid: Always review auto-subtitles line by line. Fix brand terms, style appropriately, time-check on mobile preview.
Rendering once and shipping
What goes wrong: First render has timing issues, voice glitches, or audio balance problems that go unnoticed because you are too close to it. They are obvious to viewers.
How to avoid: Plan for 2-3 render iterations. Walk away between renders to see with fresh eyes.
Recap
Done — what's next
How to set up your Synthesia account the right way
Read the next tutorial
Hand it off
Producing one Synthesia video is a project. Producing them weekly at brand-quality is a job. EverestX video specialists familiar with Synthesia run $400-1,200/mo for ongoing video production at $14-16/hr.
See specialist rates
First video: 2-3 hours including learning curve. By video #5: 60-90 minutes. Specialists can produce ongoing video in 45-60 minutes per piece for routine use cases.
Most viewers detect within 5-10 seconds. The strategy is not to hide the AI — it is to make the AI version of the content better than a low-budget real-person version. Used correctly, viewers prefer high-quality AI to bad live video.
Yes — see the sales outreach tutorial. With personalization (name, company, role overlays) and short scripts (45-60 seconds), Synthesia outreach can outperform live-recorded outreach for cold prospects.
Loom records you talking on camera. Synthesia generates an avatar speaking your script. Loom is faster and more personal. Synthesia scales and supports localization. Different tools for different jobs — many teams use both.
Synthesia
Synthesia is powerful but plan-locked — pick wrong and you re-record every avatar video when you upgrade. This is the setup specialists run.
Synthesia
The brand kit is the single feature that scales your video production. Done right, every new video starts on-brand. Done wrong, you retrofit branding 50 times.
Synthesia
DIY Synthesia works for a stretch. Then production volume, brand consistency, and editing time hit a ceiling. This is the framework for when a specialist earns their fee.