Loading tutorials…
Loading tutorials…
CapCut's Auto Captions is one of the best free transcription tools in any video editor — when configured. Out of the box, it'll mangle brand names, drop punctuation, and over-style. This walks through the workflow specialists actually use.
Who this is forAnyone publishing short-form video with captions (which should be everyone — 85% of TikTok and Reels viewers watch on mute). If you're spending 15+ minutes per video on caption cleanup, this saves you most of that time.
What you'll need
Step 1
Auto Captions accuracy depends entirely on what language and which audio track CapCut uses. Set both explicitly.
Open your project in CapCut Desktop → click the audio track you want captioned to select it.
In the top toolbar, click Captions → Auto Captions.
In the dialog, set Language explicitly (don't trust auto-detect — it gets accents wrong 20% of the time). Set Sound source to the specific clip/track that has your voice.
If you have B-roll music in a second track, mute it temporarily during Auto Captions (M shortcut). Music in the audio adds noise to the transcription model.
Click Generate. Processing takes 10-60 seconds for a 60-second clip.
Step 2
Compile every brand-specific word the AI will mangle (product names, founder names, niche terms). You'll use this on every project.
Open a doc (Notion, Google Docs, anywhere). Title it "[Brand] CapCut Glossary."
List every word CapCut will likely misspell: product names ("EverestX" not "Everest X"), founder names, industry jargon ("DTC," "SaaS," "EBITDA"), acronyms that should stay uppercase.
Also include words you say but that are NOT brand-specific yet cause errors: "your" vs "you're," "its" vs "it's," "to" vs "too."
When you clean up captions later, run a manual Find & Replace through these. CapCut Desktop has a search bar in the captions panel (click the magnifying glass).
After 5-10 projects, you'll have a stable glossary that catches 90% of the consistent errors.
Step 3
Auto Captions produces a wall of text. Manually break lines, add punctuation, fix brand terms. ~10-15 min for a 60-sec clip.
In the captions panel, you'll see one line per detected pause. Each line is editable — click it.
Punctuation pass: Auto Captions almost never produces periods, commas, or question marks correctly. Add them. Punctuation makes captions readable and improves comprehension by 40%+.
Line-break pass: Each on-screen caption should be 2-4 words on a SINGLE line. Long lines wrap awkwardly on phones. Use the Enter key inside the captions panel to manually break.
Brand-term pass: Run your glossary through Find & Replace.
Tone pass: If you said "um" or "like" or had a stutter, delete those captions entirely (don't just delete the words — delete the caption row so the timing snaps to the next phrase).
Step 4
TikTok captions, Reels captions, and Shorts captions each have a 'native' look. Match it — it boosts watch time.
TikTok native look: bold white sans-serif, black drop shadow, centered, 1-3 words at a time, positioned center-screen or slightly above center.
Reels native look: bold white sans-serif, no shadow (just hard outline), slightly smaller than TikTok, positioned in the lower third.
Shorts native look: similar to TikTok but slightly more conservative — fewer effects, smaller text.
In CapCut, click any caption → Style panel → set Font (use a system font like Inter or Helvetica — custom fonts may not render correctly in some exports). Set Color: White, Stroke: Black 2-4 px, Background: None or 50% black.
Important: select ALL captions before changing style. Otherwise you'll only style the active one and the rest will be CapCut's default purple-pink gradient (which screams 'not edited').
Step 5
Auto Captions usually aligns within 0.5 seconds. For comedic timing, punchlines, or CTAs, you need 0.1-second precision.
Click a caption in the timeline (not the panel). Drag the edges to extend or shorten when it appears/disappears.
For punchlines: the caption should appear EXACTLY when the word is spoken. Even 200ms late kills the joke.
For CTAs ("link in bio," "follow for more"): hold the caption on screen 0.5-1 second longer than the speech — viewers need time to read.
For B-roll voiceover: caption timing is less critical, but ensure no caption overlaps two B-roll clips (it looks broken on the cut).
Step 6
For TikTok/Reels/Shorts, ALWAYS burn captions into the video. Don't rely on the platform's auto-captions — they're not what you edited.
In CapCut Desktop, captions are part of the video timeline by default. They will burn in on export — no separate setting needed.
Verify by previewing the export at full resolution before finalizing.
If you ever want a separate .srt or .vtt file (for YouTube or a podcast platform): in CapCut, go to Captions panel → ... menu → Export Subtitles → choose SRT format.
Don't post a video to TikTok and toggle ON the platform's auto-captions if you already burned in your own. You'll get two sets of captions, one of which is wrong.
Common mistakes
Trusting auto-detect for language and audio source
What goes wrong: Accuracy drops 20-30%. You spend the cleanup time you "saved" by skipping the setup step. Worse, you train the team to think "auto-captions are bad" when the real issue is configuration.
How to avoid: Always explicitly select language (English-US, English-UK, Spanish, etc.) and audio source before clicking Generate.
Skipping the punctuation pass
What goes wrong: Captions without periods or commas read like a wall of words. Viewers swipe faster. Completion rate drops 5-15%. The algorithm sees worse retention and surfaces the video less.
How to avoid: Spend 5-8 minutes per 60-second clip adding punctuation. The retention lift pays back compoundingly.
Using long, wrap-prone caption lines
What goes wrong: Two-line captions take up too much screen, cover the visual, and look broken on smaller phones. Watch time drops because viewers literally can't see the content.
How to avoid: Break every caption to 2-4 words MAX per line. Use the Enter key inside the captions panel to control breaks manually.
Leaving CapCut's default font/animation on captions
What goes wrong: Your video screams 'unedited CapCut export.' Brand looks low-effort, viewers swipe faster, the algorithm picks up on it. Some brands have seen 20-40% CTR drops from this alone.
How to avoid: Always restyle to a clean white-on-black-shadow or white-with-stroke look. Static or subtle Type animation only.
Not building a brand glossary
What goes wrong: Every project re-introduces the same brand-term errors. Over 6 months and 100+ videos, that's 15-20 hours of preventable rework. You also publish videos with misspelled product names occasionally.
How to avoid: Spend 30 minutes building a glossary doc with every term CapCut mangles. Reuse on every project. Update quarterly.
Double-captioning (your captions + the platform's)
What goes wrong: You posted your edited video to TikTok with caption auto-add toggled on. Both layers display. Your edited captions are now half-covered by TikTok's auto-generated, lower-quality ones.
How to avoid: When uploading to TikTok/Reels/Shorts, toggle OFF auto-captions. Your burned-in captions are what should display.
Recap
Done — what's next
How to set up CapCut on mobile and desktop (and which one to use)
Read the next tutorial
Hand it off
Caption cleanup is the most time-consuming part of short-form editing — 15-25 min per video for marketers, 5 min for a trained editor. If you're posting 20+ videos a month, the math is clear. EverestX short-form editors handle the entire pipeline (capture handoff → edit → caption → export → upload) from $14-16/hr.
See specialist rates
CapCut is ~90-95% accurate for clean English audio — comparable to Descript and Otter for English. It drops 5-10 points for accented English, fast speech, or noisy background. For non-English languages, Descript is usually 2-5 points more accurate, but CapCut is free and integrated.
Yes — after generating Auto Captions, click the ... menu in the captions panel and choose Translate. CapCut supports 20+ languages. Quality is solid for major languages (Spanish, Portuguese, French, German) but degrades for less common pairs. Always have a native speaker review before publishing.
Three usual causes: (1) you exported at lower-than-source resolution (1080p source → 720p export blurs text), (2) bitrate is too low (raise to 8 Mbps for text-heavy short-form), (3) you applied a Gaussian blur or motion blur effect that affects the caption layer. Check each.
Sparingly. The word-by-word color highlight works for high-energy CTAs and is well-suited to TikTok. It looks tacky on B2B / SaaS / professional services content. Match the animation style to your brand voice — if you'd never see a word-by-word highlight on a McKinsey video, skip it for your brand too.
Yes, but it's painful past 30 seconds of content. Mobile's keyboard makes precision edits slow. Capture and rough-cut on mobile if needed, but do caption cleanup on desktop. You'll save 15-20 min per video.
CapCut
CapCut runs everywhere — phone, tablet, Mac, Windows, web. The platforms are NOT equal. This walks through the real setup on both, with an honest take on which one wins for marketing teams (it's not the obvious answer).
CapCut
Editing one video at a time burns 30-45 min per video. Batching the same 10 videos in one session drops that to 8-12 min each. This walks through the exact CapCut workflow content teams use to ship at scale.
Descript
Descript's killer feature: edit your video by editing the transcript. Delete a word in the doc, the audio/video deletes too. This walks the full workflow — Filler Word Removal, Word Find & Replace, Strikethrough, and Studio Sound — that makes Descript 5-10x faster than traditional editors.
CapCut
DIY short-form video is a great idea — until it isn't. This is the honest framework: when the cost of editing your own videos exceeds the cost of hiring a specialist, and how to tell which side you're on.