As an author of stories for language learners it’s always important to me to create materials that are both fun and useful on a day-today basis. So perhaps it will not come as much of a surprise that many of these works are based on my own experiences (and frustrations) of learning languages myself.

A couple of weeks ago, for example, in an effort to bolster my Hebrew reading-comprehension skills, I was delving into an Ephraim Kishon story and really wished that there was a way to have the text read aloud to me while I was parsing it. As you may know, modern Hebrew is written without vowels which can make the pronunciation of unknown words really tricky sometimes. This paragraph would read something like this:

cpl f wks g, fr xmpl, n n ffrt t blstr my Hbrw rdng-cmprhnsn sklls, ws dlvng nt n phrm Kshn stry nd rlly wshd tht thr ws wy t hv th txt b rd ld t m whl ws prsng t. s y my knw, mdrn Hbrw s wrttn wtht vwls whch cn mk th prnnctn f nknwn wrds rlly trcky smtms. Ths prgrph wld rd lk ths:

Fun times, right?

The Bewildering World Of Talking Ebooks

Since I couldn’t force the publisher to integrate narration into the text, I decided to look into various ways of creating something like this for my own learning materials, since I have both text and the audio (each published separately) for many of my books. Ideally I wanted something where the ebook would mark each sentence as it was being read aloud and also allowing for tapping on certain sentences to hear them read.

Now, some of you might say: but Amazon already has something like that. No need to reinvent the wheel!

And it’s true. If you own the Kindle edition of a title and also have the Audible edition, Amazon will automagically bring them together as part of a feature called “WhisperSync for Voice”. Since most of my books are available on both Kindle and Audible, I thought it to be no big feat to just bring the two together.

Fiddlesticks!

Turns out that

a) Amazon doesn’t make this available for all titles,

b) the conditions for making it available are really opaque

Apparently the Kindle edition and the Audible edition need to be word-by-word duplicates, which makes sense since the app has to align the audio with the text but really is a bummer for books such as mine that contain extra text like vocabulary and exercises.

So Amazon and its Kindle devices & apps unfortunately were off the table.

Next, I started looking into the open EPUB standard, which literally everyone (except Amazon!) uses, and lo and behold – EPUB3 supports this feature called “Media Overlays”. I built a quick prototype (with the help of Alberto Pettarin‘s excellent guide), tested it on Windows Adobe Digital Editions and it was working!

Unfortunately, when I put the resulting EPUB on my iPad and tried to open it with iBooks, the narration was *poof* gone, non-existent. After a bit of research I found out that Apple apparently only supports narration for “fixed layout” EPUBS, not reflowable texts. Why, Apple?! The reflowable nature of ebooks is what allows for font-resizing and other customizations, whereas fixed layouts are like glorified PDFs, i.e. the amount of words and their format on each page are set in stone.

Open Standards For An Open World

So two of the biggest ebook ecosystems on the planet, Kindle and iBooks dropped out of the race before I had even really started. No biggie, right? But I liked the concept too much to just give up on it. I simply had to find a way to work around the giants and put this book into the hands of readers directly.

In other words, I had to come up with different user scenarios and find a solution which is the most comfortable for casual readers. So, Adobe Digital Editions works with EPUB3 overlays on PC, for example, but not on Android or iOS! In the end I found an excellent app called Menestrello, developed by Alberto Pettarin, which runs perfectly on iOS and Android and thus covers most of mobile devices (see updates below) . On Chrome there’s the excellent Readium extension, which catches most laptops and desktops.

However, I still wanted something simpler, something that people could just use immediately, without downloading or installing anything. That’s when I found that the same people behind the Readium extension (update: discontinued) also made an open-source web-app for embedding ePUBS directly in your browser.

So that’s what I settled on in the end:

  • offering the EPUB as a direct download
  • embed it in a “cloud reader” for immediate enjoyment

You can find out more about this project here and check out a quick demo here:

UPDATE 4/2020:

It’s been more than 2 years since I wrote this blog post. Here’s what has changed since then:

  • Menestrello for iOS is no longer available (use Cloudshelf Reader as alternative Update: discontinued. see here for alternatives)
  • Menestrello is no longer available on Google Play either (but you can still get the APK from the developer)
  • I’ve created two more TalkingBooks, for Ferien in Frankfurt and Karneval in Köln

Put simply, EPUB3 audio support has become even less commonplace in apps since 2017 (especially due to the loss of Menestrello). So while Amazon is continuing to loop more and more people into its proprietary WhisperSync for Voice program, the open source world in general and EPUB in particular is surprisingly lacking in innovation.

This seems like such a basic feature, and EPUB3 has been supporting it for years, at least technically. It’s easy to complain about Amazon, but why aren’t other platforms building open source alternatives? All the tools and foundations are there. Honestly, it’s mind-boggling. Personally, I just love this feature too much to give up on it. Especially for language learners this is so helpful.

What’s next?

Currently I’m looking into ways to develop my own simple Android and iOS apps based on Readium, but this is all going to take a lot more time. Fortunately Readium Web has been going strong ever since and it just runs directly in browser (for some reason on iOS it will not work in Chrome, only Safari).

For my latest TalkingBook I’ve finally used Tobi, “a free, open source, multimedia book production authoring tool” to iron out some issues where phrases were chopped or not aligned perfectly. Tobi is really helpful for making quick adjustments to phrasings. And if you don’t have any audio narration yet for your text you can record and edit directly within Tobi. It hasn’t replaced Aeneas for me, but has become invaluable for improving the results provided by the former.

UPDATE 4/2022:

It’s been another 2 years since last update, I just released a new TalkingBook for Ahoi aus Hamburg and I just wanted to take a few minutes to talk about the state of EPUB3 audio-overlays in the year 2022.

  • For this edition I relied solely on TOBI, skipping aeneas entirely, which made development much speedier (and less technical).
  • Since the demise of both Menestrello and Cloudshelf Reader, iOS support for these types of books seems to be getting slimmer and slimmer. The only two apps that I’ve found so far on iOS which support EPUB3 with audio overlays are Adobe Digital Editions (AppStore Link) and Kobo Books (Appstore Link). Kudos to Kobo for allowing sideloading EPUB3 with a full feature set.
  • Thorium is probably the best way to read and enjoy TalkingBooks on desktop at this point. Cross-platform compatibility (Linux, Mac, Windows), a beautiful clean UI and open-source.

I’ve been using Thorium extensively in this development cycle for proofing, finally replacing the (eternally buggy) Adobe Digital Editions (which for whatever weird reason severely degrades audio quality). In a perfect world, Thorium Reader would also exist on Android, iOS. EDRLab, the developer behind it, does offer R2 Reader iOS, but for whatever reason it doesn’t seem to support audio.  (update: looks like it’s on the roadmap) Also, an integrated lookup/dictionary function would be awesome, but it doesn’t seem a priority at this point.

My own attempts of creating an iOS/Android app around the Readium have not yielded anything substantial yet, but I’ll keep on looking into it as time allows. If you’re an iOS and/or Android developer with EPUB3 experience and would like to collaborate, just let me know.

UPDATE 1/2024:

Yet another two years (and two new TalkingBooks edition for Plötzlich in Palermo and Walzer in Wien) later, I wanted to give some quick updates on the world of EPUB3 with media-overlays in 2024.

First of all, not much has changed in terms of app support, unfortunately. I’m continuing to maintain a list of apps that support this feature, but especially on iOS support is less than ideal. (If you’re aware of other apps, please let me know in the comments!)

Having said that, there are some interesting developments in this space, most notably Shane Friedman’s Storyteller project which provides a whole framework for creating and even reading (local server required) these types of books:

“Storyteller is a self-hosted platform for creating and reading ebooks with
synced narration. It’s made of of three components: the API server, the web
interface, and the mobile apps. Together, these components allow you to take
audiobooks and ebooks that you already own and automatically synchronize them,
as well as read or listen to (or both!) the resulting synced books.”

Unlike other forced-alignment tools like aeneas or syncabook, Storyteller doesn’t just try to nail audio and text together “blindly” but uses Whisper (OpenAI’s speech recognition tool) for creating a transcript from the audio and matching it with the text using some nice fuzzy algorithms designed to ignore any non-essential parts of the text.

Did I finally find a way to create TalkingBooks with a single click? Dropping in mp3 and epub in one side, taking out the finished EPUB3 with media-overlays on the other side?

Well, no …

Unfortunately, the base Whisper model used by Storyteller only supports English for now. I’ve tried modifying the python scripts to use the large-v2 model instead, which technically should support other languages like German as well, but that either overloaded my system (large language models can become quite, well … large) or failed for other reasons that weren’t immediately clear.

So for now, I’m back to hand-crafting, running regexes to create segments in the EPUB, finetuning  them with a didactic mindset for isolating syntactically and semantically relevant bits, then using Tobi to align the audio manually, phrase by phrase.

As a small aside, while I wasn’t able to get Whisper to work with German in Storyteller, I found out that LingQ recently integrated Whisper into their platform as well, allowing you to create transcripts from audio and then matching these with your text (phrase by phrase). The results were quite passable, but still not accurate enough for my taste. For example, while it generally detected the phrases correctly, the audio was sometimes slightly misaligned, cutting words in half at the end/beginning of a phrase.

For example, in the following phrase the “et cetera” is split halfway, so when you click on the “An manchen” paragraph the audio narration starts with “etera …” .

This happens with many other phrases as well and has been an issue I encountered with almost all of these automated tools. If I end up manually correcting every second or even third or fifth phrase, I can also do the whole thing manually from the start. The amount of time “saved” isn’t really worth the control given over to the algorithm in terms of phrase selection.

So while Whisper does seem to do a generally acceptable job, it’s just not accurate enough for my liking yet. Also, while playing around with Whisper in this Colab, I noticed that it can be extremely tricky to get the AI to consistently (!) segment phrases according to certain patterns. Either it selects too much, or too little, or varies its patterns, and I ended up spending a lot of time on what has seem to become many people’s favorite past-time over the last year or so: arguing with AI, rolling the dice over and over again in hopes of getting the desired result, almost there, but never quite.

As I wrote in my newsletter, while the promise of this technology is exciting in theory, all of this has left me with a renewed appreciation for how humans naturally and effortlessly parse language. We read. We listen. We speak. Things just make sense. It’s amazing, if you think about the gazillions of calculations that have to be run on a machine to emulate only a fraction of our innate ability for language. And then it’s still very hit-or-miss.

Obviously these tools are all rather new, and AI will probably (hopefully?) get better at providing more consistent results in the future. But for the time being, I think there really is no easy shortcut when it comes to creating these types of books.