TTS Services Comparison — Amazon Polly & Acapela

This blog post will be a comparison of two competing text-to-speech (TTS) services, Acapela and Amazon Polly. Many products and services can be improved from a user-experience perspective, by having a human-like voice communicate verbally, to the user. These kinds of services can make great use of environments where having a visual user interface is unhelpful or not possible, products like GPS voice navigation and digital secretaries such as Google Assistant and Amazon Echo. I’ll be comparing the two services by discussing the three most important parts of what makes a good TTS service: speech quality, number of voices and languages, and the cost of generating audio.

Speech Quality

Perhaps the most important quality a TTS service should have is correctly spoken and human sounding audio. Acapela and Polly both offer correct and realistic audio, although this is dependent on which voice and language you choose. The British English and American English voices on both Acapela and Polly sound very good, but the Australian English options for Polly sound noticeably more artificial and robotic, when compared to Acapela’s Australian English options.

As far as pronunciation is concerned, both options will occasionally mispronounce words, but are correct the vast majority of the time. Even when mispronunciation occurs, most instances are not severe enough to inhibit understanding of the word in question or the sentence as a whole. Both services also allow specified custom pronunciation, which is useful for correcting consistent errors that the voices might make. These are especially useful to force correct pronunciation of email addresses, abbreviations, and product or business names.

A flaw both services share as far as imitating human speech goes, is a lack of appropriate pauses when speaking. Pauses can be forced by altering the text that the service is generating speech from, however, many use-cases will involve using dynamically generated text, which cannot be manually tweaked to sound just right.

Language and Voice Selection

At the time of writing, by raw numbers Acapela wins on the total number of languages and voices available. Featuring 34 languages and over 100 different voices, in comparison to Polly’s 24 languages and 47 voices. It is worth noting that although Acapela has over double the amount of voices, many of these voices are very gimmicky, and not suitable for many use-cases. The most common non-standard voice offered by Acapela, described as “childlike”, may be useful for children’s toys and games but not much else.

An important difference here is which languages are unique to the two services:
Acapela uniquely offers Arabic, Catalan, Chinese, Czech, English (Scotland), Finnish, Greek, Korean, and Sami.

Polly on the other hand uniquely offers English (Welsh), Icelandic, Romanian, and Welsh. Acapela is the clear winner here although Polly still has a couple languages available which Acapela does not.

Costs

Comparing the two services by their pricing is completely one-sided. Amazon Polly allows a “pay-as-you-go” and free tier subscriptions. Amazon’s free tier accounts allow 5 million characters a month for free, for the first 12 months of use. Amazon estimates 1 million characters to be roughly 23 hours of generated audio, so 100 hours a month of audio for the first year without having to pay anything is a very good deal. The pricing for any characters over that 5 million, or once your free first year has expired is also quite cheap, being $4.00 USD per 1 million characters.

Acapela has two payment models: a bulk price scheme where you pay for a certain amount of generated audio, and a subscription based model where you pay yearly. They charge by a unit they define as a Voice-as-a-Service unit (VU), where a VU is 20 seconds of generated audio.

Acapela’s bulk pricing is more expensive and less convenient than Polly’s pay-as-you-go model, for which a single unit is 20 seconds of audio. The minimum bulk purchase costs is €1,500 per 10,000 VU, or approximately 55.6 hours of audio. Their subscription rates start at €1,800 a year for 36,000 VU, or 200 hours of audio. It is worth noting that the bulk and subscription models prices get cheaper the larger plan you are using, which will be a factor if you plan on using lots of audio.

To make a good comparison of the costs I’ll be roughly converting the rates to the same unit of measurement, which will be USD per hour of audio. The results of these conversions gives:

  • Amazon Polly: $0.17 per hour
  • Acapela Bulk Pricing: $30.29 per hour
  • Acapela Yearly Subscription: $10.10 per hour

It’s quite plain to see here that Amazon Polly is much cheaper than anything Acapela offers. Although the pricing for Acapela gets better for larger allowances, even they are not at all competitive with Polly’s pricing.

Conclusion

It’s hard to beat Amazon Polly’s prices here, being several times cheaper than anything Acapela offers. The main reason to use Acapela here is if the premium price of certain voices is worth it, and if you wish to offer TTS audio in one or several of the languages unique to them; otherwise, Polly is a clear winner for most uses.

Header image courtesy of Ruth Caron.


Thanks to Allan Jones and Shannon Pace for proofreading.