Projektika tools Audio to Subtitles

Download wav Download mp3 Download VTT file

Concept, IMHO

In general, my approach to transcribing and/or translating a film into subtitles is quite simple and slightly different from most sites. In my opinion, you don't need to upload the whole film to the server. Speech recognition systems (such as the widely used Whisper AI) work on audio, not video. So there is no need to clog up the network with unnecessary data.

The second issue is simply file size. Video files often take up several gigabytes. Compressed mp3 audio files are only a few megabytes. Think about how much energy it takes to transfer a huge file compared to a small mp3. It doesn't matter for a single file, but billions of files are transferred around the world every second. On such a large scale, this is a big saving. OK, a large file also requires a lot of compression time, but I changed the compression method to FFMpeg.wasm, and on my Mac a 1.5 hour mp4 video compresses to mp3 in about 120-170 seconds. And sometimes even shorter, I don't know what it depends on. But instead of uploading about 1.4 GB, I upload about 70-90 MB. That makes a big difference to me because I don't have a very fast internet connection.

Thirdly, if you want to edit the subtitles, the film you have on your hard drive does not necessarily need to be transferred over the network. There is no point in using a remote video file that you have on your hard drive. If you edit the subtitles, the file will be transferred dozens of times! I understand that convenience is important, but the paradox is that the faster the network, the more congested it becomes and the slower it runs.

So in my opinion, if you have a film file on your hard drive, just send the audio, because the subtitles are integrated into the film with timecodes that are the same for the audio and the film. And in this simple way you save our little planet :-)

network, ecology, philosophy, my opinion

Step by step

You load a video file, the audio track is extracted from it, compressed to mp3/128 and sent to the speech processing system. Then you have to wait a while, depending on the length of the video, it could be a few minutes or even several minutes. You will be kept informed of the progress on screen. It takes a little time to generate the waveform, in fact it is not needed very much. But it happened to me a couple of times to send a file that didn't have an audio track, and on the graph you can immediately see that something is wrong.

Unfortunately, a maximum of 2GB of the resulting file can be compressed in a web browser. In some, up to 4GB. In practice, it happens to be slightly smaller. So if you have a very large movie file, compress it to mp3 on your own. It will be faster and you will avoid unpleasant surprises.

When finished, links to download subtitles and translations will be displayed. A subtitle track will be added to the video player, just turn on Play Video. If you want to edit the subtitles, you can use a handy subtitle editor.

It is important to remember that speech recognition is not a perfect process. People sometimes hear something different from what someone has said. It is worth watching the whole film with subtitles and correcting any errors.

Most of the operations take place in the browser, only the files of the recognised speech are downloaded to the server. They are all available in the user panel, but you must be logged in BEFORE you can upload a file to the speech recognition system.

I didn't have time to programme fancy security systems. The variety of events that can occur is enormous. From a simple browser shutdown to server crashes. I will gradually improve over time, but for now, unfortunately, if something goes wrong during processing, you will have to start the whole process over again. However, to be honest, this is extremely rare.

transcribe, translate, audio to text, speech recognition

General info

Speech recognition technology has come a long way in recent years, allowing for the conversion of spoken language into written text with impressive accuracy. This technology is particularly useful in a variety of settings, including dictation for writing documents, transcribing interviews and meetings, and enabling hands-free operation of devices like smartphones and smart speakers.

The process of speech recognition involves several steps. The signal is then pre-processed to remove noise and other unwanted sounds. Next, the speech signal is segmented into smaller units, such as phonemes or words, which are then recognized using statistical models or deep learning algorithms.

One of the biggest challenges in speech recognition is dealing with natural language variability, including accents, dialects, and individual speaking styles. To address this, speech recognition systems often use acoustic models that have been trained on large datasets of diverse speakers.

Once speech has been recognized, it is typically converted to text using natural language processing (NLP) techniques. This involves analyzing the syntax and semantics of the language to produce a meaningful representation of the spoken words. NLP can also be used to correct errors that may have occurred during the speech recognition process.

Concept, IMHO

Step by step

General info

Error