How to create your first Synthesia video with an AI avatar

Your first Synthesia video sets the pattern for every video after it. This is the structure specialists use so video #1 looks like video #100.

~2-4 hrBeginnerUpdated May 26, 2026

Who this is forNew Synthesia users producing their first video. Marketing teams who want to skip the "looks like a hostage video" learning phase. Anyone who wants the production-quality version, not the trial-output version.

What you'll need

Synthesia account on Creator or higher (Starter has too many limits)
Brand kit configured (see the brand-kit tutorial)
A 60-90 second script idea — product feature, welcome, sales outreach, training intro
About 2-3 hours for the first video

Step 1

Write the script before opening the editor

Scripts written in the editor are 30% longer and 50% less focused than scripts written in a doc first. Write outside, paste in.

Open Google Docs, Notion, or any text tool.

Write 150-200 words for a 60-90 second video. Synthesia avatars read at ~150 words/min.

Read it out loud. Cut any sentence over 20 words. Cut any paragraph over 3 sentences.

Write for the ear, not the eye. Conversational rhythm beats written-prose rhythm every time.

Mark intended pauses with [pause] — Synthesia respects these as natural silence.

Step 2

Pick the right avatar for the message

Avatar selection drives 50% of viewer believability. Match avatar tone to script tone.

Synthesia ships 200+ stock avatars across ages, ethnicities, styles.

For sales outreach: clean, warm, age-matched to your target buyer.

For training: more authoritative, professional dress, neutral background.

For social media: more casual, animated, lifestyle background.

Filter avatars by tone (corporate, casual, friendly) and shot (close-up vs medium).

Preview 5-7 avatars with a 1-line snippet of your script before committing.

Step 3

Tune the voice to match the avatar

Voice and avatar should feel like the same person. Mismatched pairing is the #1 reason videos feel uncanny.

For each avatar, Synthesia suggests 2-4 voice matches. Start with those.

Adjust pace: 90% default speed reads more natural for most marketing video.

Adjust emphasis: highlight 3-5 words per paragraph for emphasis using the editor controls.

Add SSML if needed: <break time="500ms"/> for pauses, <emphasis> for stress. Synthesia editor exposes these visually.

Test the first 10 seconds before committing. Voice fit becomes obvious or wrong fast.

Step 4

Build scene structure for retention

Single-avatar 90-second videos feel monotone. Cut to b-roll, screen recordings, or text slides every 10-15 seconds.

Scene 1 (0-10s): avatar intro with strong hook. Avatar on screen.

Scene 2 (10-25s): cut to screen recording, product shot, or supporting visual. Avatar may stay as a small picture-in-picture.

Scene 3 (25-50s): back to avatar, main content. Add text overlays for key points.

Scene 4 (50-75s): supporting visual or chart. Show what you are saying.

Scene 5 (75-90s): avatar close-up for CTA. Direct eye contact, slower pacing.

Synthesia ships scene templates — use them as starting points, not finished structure.

Step 5

Add subtitles and review on mobile

85% of social video is watched muted. Subtitles are not optional.

Synthesia auto-generates subtitles. Click "Add subtitles" → review.

Review every line for misspelled product names, acronyms, brand terms. Auto-generation gets these wrong often.

Style: white text with black background or shadow. Readable on bright phone screens.

Preview on mobile dimensions (9:16 or 16:9 zoomed) before final render.

Reading speed: subtitles should match avatar pace. Reset timing if anything reads too fast or too slow.

Step 6

Render, review, iterate before publishing

First render is rarely the final. Plan for 2-3 iterations before shipping.

Hit Render. 1080p typically takes 5-15 minutes depending on length.

Watch the rendered output on the device your audience will use (phone for social, desktop for B2B).

Note timing of any "uncanny" moments — usually mouth shape on specific words. Adjust pacing or rephrase.

Check audio levels, music balance, intro/outro transitions.

Re-render. Most production-quality videos go through 2-3 render iterations before publication.

Common mistakes

What goes wrong (and how to avoid it)

Writing the script in the Synthesia editor
What goes wrong: Script becomes longer and less focused because the editor encourages adding to fill scenes. Final video runs 2-3 minutes when 75 seconds would have been better.
How to avoid: Write the script in a doc first. Tight, conversational, read-aloud-tested. Then paste into Synthesia.
Picking an avatar that does not match the script tone
What goes wrong: Avatar feels uncanny. Viewer cannot articulate why but disengages. View-through rates drop 30-50%.
How to avoid: Preview 5-7 avatars with your script snippet before committing. Match warmth, formality, and energy.
No scene changes — 90 seconds of one avatar talking
What goes wrong: Retention drops sharply after 15 seconds. Average view time hits 30-40% even with great script.
How to avoid: Cut to supporting visual every 10-15 seconds. Even a static text slide breaks the monotony.
Skipping subtitles or shipping with auto-generated ones unedited
What goes wrong: Muted social viewers cannot follow. Brand terms appear misspelled. Embarrassment compounds.
How to avoid: Always review auto-subtitles line by line. Fix brand terms, style appropriately, time-check on mobile preview.
Rendering once and shipping
What goes wrong: First render has timing issues, voice glitches, or audio balance problems that go unnoticed because you are too close to it. They are obvious to viewers.
How to avoid: Plan for 2-3 render iterations. Walk away between renders to see with fresh eyes.

Recap

What to take away

Write the script in a doc, not the editor. Read it aloud.
Match avatar tone to script tone. Preview multiple.
Voice and avatar should feel like the same person.
Scene structure: cut every 10-15 seconds for retention.
Subtitles always. Review every line. Plan for 2-3 render iterations.

Done — what's next

How to set up your Synthesia account the right way

Read the next tutorial

Hand it off

Producing one Synthesia video is a project. Producing them weekly at brand-quality is a job. EverestX video specialists familiar with Synthesia run $400-1,200/mo for ongoing video production at $14-16/hr.

See specialist rates

Frequently Asked Questions

How long does a Synthesia video really take?

First video: 2-3 hours including learning curve. By video #5: 60-90 minutes. Specialists can produce ongoing video in 45-60 minutes per piece for routine use cases.

Will viewers know it is an AI avatar?

Most viewers detect within 5-10 seconds. The strategy is not to hide the AI — it is to make the AI version of the content better than a low-budget real-person version. Used correctly, viewers prefer high-quality AI to bad live video.

Can I use Synthesia for sales outreach video?

Yes — see the sales outreach tutorial. With personalization (name, company, role overlays) and short scripts (45-60 seconds), Synthesia outreach can outperform live-recorded outreach for cold prospects.

How is Synthesia different from Loom?

Loom records you talking on camera. Synthesia generates an avatar speaking your script. Loom is faster and more personal. Synthesia scales and supports localization. Different tools for different jobs — many teams use both.

How to create your first Synthesia video with an AI avatar

Write the script before opening the editor

Pick the right avatar for the message

Tune the voice to match the avatar

Build scene structure for retention

Add subtitles and review on mobile

Render, review, iterate before publishing

What goes wrong (and how to avoid it)

What to take away

Frequently Asked Questions

Related tutorials

How to set up your Synthesia account the right way

How to set up Synthesia brand kit so every video ships on-brand

When to hire an AI video specialist who knows Synthesia

How to create your first Synthesia video with an AI avatar

Write the script before opening the editor

Pick the right avatar for the message

Tune the voice to match the avatar

Build scene structure for retention

Add subtitles and review on mobile

Render, review, iterate before publishing

What goes wrong (and how to avoid it)

What to take away

Frequently Asked Questions

Related tutorials

How to set up your Synthesia account the right way

How to set up Synthesia brand kit so every video ships on-brand

When to hire an AI video specialist who knows Synthesia