Video and audio to text transcription – make or buy?

This article discusses the pros and cons of buying a transcription service or making your own. It compares different services and looks at features like transcription of English, recognition of speakers, export of transcripts with timecodes, and recognition of key phrases or topics. It also looks at pricing, accuracy, and bonus services like human transcription. It concludes that if you have enough own/internal IT resources, you can create a tool of your own, but it will not come for free and you may have to close gaps in regards to certain requirements.

Recently, we received a request to assist with automated video transcription, which involves generating text from a video’s audio stream. As is often the case, we are faced with the decision of whether to buy or build it ourselves. We could purchase one of many transcription services available online, or we could code it ourselves using existing libraries or boilerplate codes. It is not an easy question to answer, as using a per-minute billed online transcription service to transcribe a large amount of video material can be quite costly. Nevertheless, if we build it ourselves, we would need to have some technical knowledge and may not achieve the same quality as services which specialize in this. To get a more definitive answer, I decided to try both approaches.

First of all, our main requirements:

  • Transcription of English (audio, better video)
  • Decent transcription quality for English with accents and dialects
  • Recognition of speakers
  • Export of transcripts with timecodes
  • Recognition of key phrases or topics
  • Improve transcription quality with own glossary/ terms
  • Improve transcription quality by feeding back manually corrected transcripts
  • (Online) editor for your transcript

Furthermore, I have looked into the following aspects of existing/commercial transcription services:

  • Whether a trial is available
  • If manual/human transcription is offered
  • Considering data privacy and GDPR: an offline or EU based data processing option

Buy: online transcription services

A small disclaimer on my approach here: I did not make a full blown research on all possible transcription tools exists out there, rather I started of googling “online transcription” “video transcription” and I was looking into a popular qualitative text analysis tool “MaxQDA”, which was listing certain transcription tools.

NameWebsite link
Transcription tools I analysed

My comparison

CategoryAuto / manual transcription  Auto transcription   
Transcription of English videoyesyesyesyesyesyesyes
Export of transcripts with timecodesyesyesyesyesyesyesyes
Decent transcription quality for English with accents and dialects??yes??? ?
Recognition of speakersyesyesyes?yesyesyes
An offline or EU based data processing optionyesIRL?yes (Amazon EU)Azerbaidschan?USUS, but onprem availableUS
Own glossary/ termsnoyesExtra service?noyesyes
Feeding back manually corrected transcripts???????
Recognition of key phrases or topicsnonoExtra service?no??
(Online) editor for your transcriptnoyesyes?yesyesyes

As you may see, there are several question marks where I could not verify if that requirements can be fulfilled. It would require further analysis. E.g. only one service claims to be able to recognize dialects. While others do not claim that they might be able to do this aswell. Finally, this has to be benchmarked against some samples, which I did not do yet, either.

Feedback on the manually corrected transcripts to improve the language model would also be a logical feature of such an online service. It is however not obvious if they do that. This might be a tricky feature, on the one hand, for data privacy reasons, and on the other hand it may “deteriorate” your text recognition models, making them rather worse than better. To clarify this, one would have to get in contact with the providers if they really do this on a technical level plus you should test it over a period of time, if the quality of transcription is really improved by that.


Next we will look at the pricing. All transcription service pricing models come down to a per minute price. However some require a subscription to use their service, some just offer subscriptions to save money.

Pricing automatic transcription
CHF per minute min CHF 0.04 CHF 0.21 CHF 0.20 CHF 0.10 CHF 0.23 CHF 0.06 CHF 0.08
CHF per minute max CHF 0.33 CHF 0.21 CHF 0.33 CHF 0.20 CHF 0.23 CHF 0.23 CHF 0.17
Pricing manual transcription (English)
CHF per minute min CHF 2.20 CHF 1.77 CHF 3.12    
 CHF per minute max CHF 6.60      
 Subscription/packages requiredyesnooptionalyesnooptional
Package size and monthly price
Small5 h / 103 CHF 5h / 62CHF yearly, 5h / 78CHF monthly1.5 h / 20 CHF 2h / 22 CHF$5 / HOUR, PLUS $22 PER USER/MONTH
Mid 100h~100 h / 519 CHF no100 h / 519 CHF 30 h / 93.18 CHFno
Large1000 h / 2600 CHF nono nono
Package size and minute price
Small per minute price0.343 0.2070.222 0.1830.087
Mid per minute price0.087 no0.087 0.052no
Large per minute price0.043 nono nono
Transcription service pricing

These transcription services claim to have an accuracy around 85%. While I have not yet tested this in depth, I believe that they can do this for clear English audio/video material. When it comes to other languages with variables such as accents and dialects, that number probably will drop. But these online services have their advantages, you basically drag and drop your video into the website and it spits out in a more or less short period of time your transcript. Additionally, as being online and some offering an online editor for your transcripts, they can basically continuously improve their natural language processing models (I did not check though if they actually do this).

All this comes at a cost of between 0.04 to 0.33 Swiss Francs (around 0.04 to 0.36 $ Dollar) per minute, depending on the subscription / package you choose. Therefore an 1 hour interview transcript would cost you between 2.50 CHF and 20 CHF – mind the price difference of around 18 CHF per transcription hour.

An interesting bonus is the human translation service, which can be purchased along with the automatic transcription with some of those services. With that option you may save money on less important transcription jobs or if you have enough own capacity to fine tune the transcriptions, but if you need clean transcripts, want to have the transcripts fast and save internal resources, you may switch to the human transcription on the same platform.

Next I will look into the option, if you have enough own / internal (IT) resources to make your own transcription service – which is not as difficult as one might imagine

Make: your own transcription service

I wouldn’t be a software engineer, if I would not at least consider to find a custom coded solution for our requirements. Luckily there are tons of free and open source libraries out there to solve our problems.

One of this library or better boilerplate solution which uses various existing open source libraries is vid2cleantext

While the default video with one of John F. Kennedy’s speeches worked quite well, I was wondering if it works well with other examples where the English is maybe less “Oxfordian”. So I picked following video:

When clicked, this video is loaded from YouTube servers. See our Privacy policy for details.

You may say, well there are already subtitles provided by Youtube… well, these are manually curated subtitles (which are indeed accurate). But if you enable the actual Youtube generated subtitles (clicking on CC), you will see that our upcoming self generated transcript is not that far away from Google’s sophisticated engine. Additionally, your subtitles can be generated offline, without sending data to Big Brother (imagine you are conducting research on sensitive (geo-)political or classified corporate topics ;-).

Our result:

Neoa me well indeed i think its well known that we h’ve got the biggest epidemic in the whole world. With about seven million people infected and we have to put as many as possible on a series in order to get them virantly suppressed so that the virus is no longer transmitted so for that reason are running the biggest e can. Retroviral programs on earth and so we had to do everything in our power to make sure that the medicines are affordable we studied in twenty ten massively reduced the price of a year woes by fifty six per cent. Then we went over to twenty twelve when we introduced the f diz de fixtos combination at that time it was costing three hundred and forty one per per cent per mile after the fixdos combination it cost eighty nine. Person females we are now on the verge starting from july dinner to introduce a new fix to new fixto dose combination with the lutegrava and its going to go down. Seventy five it ran pekes name a game i a a e maaoa o oh. Yes it has maketly done so by increasing the pool of drugs that are available so that introducing accessibility but also competition because as you know convention will knock the prices down so they help. Us introduce new drugs into the market and and we believe for instance that the without the medicine patent put would not have been able to introduce some of the changes that we’ve done like this donitacarval I’ve just mentioned. The medicine potentially did a very good thing by negotiating with originator companies to issue licenses for those who produce medicine at an affordable price and make them available later to the population and thus export. A exhibit boon all the challenges is that the program is huge but the biggest challenge her well name the program is huge and sometimes. Formasydical companies cannot cope with the supply you know the demand is too high and the supply sometimes becomes wanting but the biggest problem are faced with is loss to follow up because we have realized that we. We are putting six hundred thousand people per annum on wards and after six months twenty per cent of them get lost to treatment which is quite problematic but have studied a new program to deal with that in February only we’re a. Able to trace twenty nine thousand of those who are lost to treatment and who are going on doing that but that’s that’s really the biggest challenge thatthe big challenge were having the second challenge is that if you look at the south african air program on nerves. Two thousand and four we had four hundred thousand people on a drive to say we have got four point five million it means it grew more than tenfold within this short period of time and and the their health web force. That time the health facilities could not have increased at that level so the result is long waiting times that means you’ve got more and more people looking for health care and it costs long waiting times irritating people it also can. Tributes to drop out of treatment where people get discouraged for going to eat for a long time when you add to that the explosion of a noncommunicable disease which i believe years away of tybitis high blood pressure can. Which we share with most of the world so the overall result is that we are overcrowded in our hospital the demand is very high and that can be very tricky young women between the ages of fifteen and twenty four are found to be particularly. Vulnerable while new infections are going down in other age groups they are unfortunately going up in this age group and that the age group of people are about to be adults who are going to be very sexually active wehave already started actually to be as. Usually active and it’s worrying us that’s why in south africa we’ve established a special program called the conkers which specifically date has for this aged group young women between the ages of fifteen and twenty four years. Is quite surprising these are some of the challenges that were faced with clearly quite surprising that technologies so advanced but and up to this era we still have a situation where we struggle with policy formalities because attention. Was never paid to them a situation where you still have to break a toilet into pieces in order to cater for a three month old baby owl will be very happy if the medicine patient pool can trail some of their guns in that direction of pie. Formulations for both and evy fats and to be no ooaooan absolutely absolutely to start with sensor drugs are devilish. Ly expensive and and i’ve just told this an explosion of cancer in fact or saying i suspect i’s going to be our new age heavy sort of intents of skill and demand and lots of social movements are going to be formed down town. Aid staying too much because they consider the prices so if the m p p and the medicine patient pool move into that area that’ll be greatly greatly helpful but in areas like t but we still have drugs like pitaguulie. Which which have proven to be extremely effective in the treatment of drugs negative the litaguline and the laminate if medicine needed pols can move in that direction that will be very helpful but you also have. Expensive diabetic medication diagnosis also me increase and dose of the areas to be medication and ocansar medication and diabetes medication and other areas by desire if you. Moving that it’ll bring a huge change and a huge relief to be is a a ban back o.

As a bonus we get following list of key topics out of this tool:

Rankkey phrase
0hundred thousand people
1medicine patient pool
2long waiting times
3biggest challenge thatthe
4times irritating people
5group young women
6waiting times irritating
7costs long waiting
8million people infected
9patient pool move
10age heavy sort
11aged group young
12fixto dose combination
13expensive diabetic medication
14challenge thatthe big
15thatthe big challenge
16african air program
17special program called
18twenty ten massively
19greatly greatly helpful
20diabetic medication diagnosis
21ooaooan absolutely absolutely
22tybitis high blood
23high blood pressure
24health web force

Not so bad, isn’t it? Well it is just like around 70% accurate, but 70% of less work you would have to do when transcribing the text.

Now, what is missing in regards to our requirements?

  • Timecodes in the text
  • Recognition of speakers
  • Improving your transcription with own glossary / terms or feeding back your manually corrected transcripts
  • (Online) editor for your transcript

To close this gaps you would need some Phyton and probably NLP savy experts which can cost a lot. So there might be a break even, when you have a lot of transcription work, but it won’t come for free. If you can live with those few unfulfilled requirement, you can have a cheap video to text transcription tool.


Automated video transcription to readable English text is not magic these days. You may buy a out of the box solution (any pay transcription per minute) or if you are a bit tech savvy, you can even implement it on your own. It is worth to look at your requirements, like languages to support, quality of transcripts and additional features like improving the text recognition with your own glossary and transcriptions or the ability to export your transcripts into other tools.

Feel free to comment on this article if you have any feedback or approach me directly via our contact form, if you need help implementing your own AI or natural language processing solution.


Big thanks to Jonathan Lehner, who supported us with the custom coded transcription solution.

Headline Photo by George Milton.

