Bash :: Dual Language Subtitles
I wrote a quick script to aide in learning a foreign language.
I noticed that my obstacles with learning another language were with parsing spoken words, pronunciation, and sentence syntax stead vocabulary and conjugation.
When thinking on how best to tackle these issues I thought of film. If I could have both the spoken language and my native language subtitles on the screen I could use the spoken language subtitles as reference to parse the syllables I am hearing to their individual words, and then use my native language as reference for new vocabulary.
I often watch movies multiple times purely for enjoyment's sake so watching a film like this adds the extra element of learning.
After a few viewings, I begin just jumping from scene to scene, and recently I have been choosing one character from a dialogue and speaking their parts in response as if I were being spoken to.
All this, and I'm watching one of my favourite films.
Anyway, I'm digging it. Thought I'd share it with you all...
I haven't tried it yet but it looks like a great idea, well done.
exactly what I was looking for. The most funny is that I didn't expect to find a solution for linux, and accepted to run some win32 soft in VM.
I have exactly the same problem as you, on one side if I don't read the english subtitle (in case of english-spoken movie), I don't understand (=parsing) the words.
And sometimes, there are words I don't know, having my native language (french) just beside is now perfect.
BTW, it worked like a charm.
pierrepoulpe, wow, thanks for the support. I am glad you found this and are putting it to good use.
Out of curiosity how did you come across this post? If you came from Search what was your wording for the search so I can make the post more search friendly.
I think I googled "dual language subtitle" or "subtitles dual language", something like that, which is almost the title of the thread...
Typical thing hard to found, you don't know which keyword a potential author may have used.
I'm finding your post highly confusing.
First of all, you never explained exactly what the script is designed to do. I had to try it out for myself to discover what it does (it appears to simply combine two srt-format subtitle files into one that displays both languages, BTW).
I also see some coding errors and other weak scripting points. This, for example:
I'd like to try my hand at making modifications and fixes to the script, but since you failed to include any comment lines, I can't quite figure out what all of your functions are supposed to be doing.(Good coders always detail what their code is doing inside the script. Not just for others, but for themselves. I guarantee that a few years down the line you'll be wondering what some of that code is doing.) Would you care to explain them, and the overall code flow?
One thing I'm not sure about, for example, is what happens if any of the timing lines in the two files are not the same. Does it compare them in any way?
In any case I'm pretty sure that it could be made more efficient and robust with just a little work. I already see one potential improvement that could simplify the whole thing quite a bit. I just need to understand what's already there first.
I feel you a bit tough with our friend. It's not claimed to be a high quality project...
He had a need, the same than me BTW, and didn't found a solution. He wrote a piece of code on table's corner - that is working btw -, and just share it...
I say thanks. It save me the time to code the same thing. We can do better? for sure. He could also have kept the code for its own...
I'm not trying to be harsh. I'm just pointing out that it pays to be explicit in both coding and internet posting. My main desire was to offer advice on how to improve the script, but I found that I couldn't effectively do that because of the difficulty I had in even understanding it. It's very tiring and frustrating having to trudge through a complex, code-only script like this and try to interpret what it does, when a few simple comments could save so much time and trouble for everyone (including the OP).
Anyway, while I do agree that it's a commendable effort, and that it generally gets the job done, it does suffer from a very large flaw that makes the subsequent code ten times more complex than it needs to be. Specifically it comes down to these two lines:
Anyway, I took an interest in this (for some reason), and actually spent several hours writing my own version of the script. It avoids a lot of the previous complexity and errors, and makes it shorter, more stable, and more efficient. My version doesn't just store the files by line-by-line, it actually stores them according to subtitle block, and indexes them according to the entry numbers already existing in the file. This better ensures that the subtitles match between the files.
I also commented it thoroughly to explain what everything is doing.
As I mentioned yesterday, the main question I had concerns what should happen if the timing info is different in each file. I decided to just ensure that, for each matching subtitle number, the longest possible time period is kept. I think it may end up causing overlapping titles though. I don't have the time or ability to thoroughly test it offhand.
The only other limitation I know of right now is that, due to one of the features I used, it requires an up-to-date version of bash.
Anyway, give it a try if you'd like:
I tried both script with attached two subtitles (to be renamed to .srt).
While it's for the same movie, relying on index might be completely wrong.
English version start with a 'downloaded from....', not the french. already an offset of 1.
English index 3, is translated into two indexes on the french : 2 and 3. By chance, it cancel the first offset....
(yes french is a bit more verbose than english, especially when it's a translation)
So I think we must rely on time. but time are not exactly the same...
Let's take a simple example
en 00:01.00 => 00:03.00 Hello
fr 00:01.20 => 00:03.30 Bonjour
First strategy : when overlaping occurs, stop current title, start a new one with both titles merged.
Quite simple, stable, support one title translated into 2 titles,but maybe not comfortable for reading.
00:01.00 => 00:01.20 Hello
00:01.20 => 00:03.00 Hello
00:03.00 => 00:03.30 Bonjour
Second strategy : make links between close times. Not sure it'll be reliable.
Maybe we can mix both strategies.
About your code David, it look nicer, but I can't read it. Not because of you. But because bash script is unreadable. And honestly, I'm not sure I want to spend effort on this.
If I had to write, I think I'd go to python : installed on many distro, portable, and much nicer to read. And there is a library to parse srt...
I appreciate the defense pierrepoulpe and you are right, this was a quick project for myself where development was far from the goal. Yet, I apologise for the lacking comments.
David the H. I thought I described the desired and functional use of the script thoroughly in the original post. Almost embarrassingly too thorough actually.
That being said I find it interesting from what perspective you chose to view the whole thread..?
More interesting is that your complaints are different than my own in regard said script.
Actually I had just been introduced to IFS and that is why I chose bash. A practical practice.
I am much more comfortable in python and if were so inclined to script it in any language I wanted it would have been python.
My biggest complaint is how long the damn thing takes. I blame piping to bc. This would have been easily avoided in a python script as python handles floating point numbers like a song, but again, was self restricting for purposes of experimentation.
This pipe was necessary though in regard to your one major worry.
I realised very quickly that the saints that make these subtitles determine when they believe a subtitle should start and stop and care little for how other people subbing other languages agree or disagree.
My solution was to give the starting time stamp a numerical value. I worked to come up with some schema that would allow for it to be only an integer... again self restricting... but was allowing this aimless endeavor to be far too distracting so I copped out and piped to bc.
From here I could subtract the starting values of one subtitle from another getting a range between them. Then I matched up the smallest ranges assuming if one subtitle is 1 second off from another, but 10 seconds from the next, clearly the subtitle belongs stacked with the first.
This is the most important element of the script. This is the checkHIGH() function. The heart, but it is still imperfect.
Languages have different grammars so sometimes a sentence will begin stacked perfectly and then the tail of it will show up stacked on the next subtitle. This happens, but its rare and hardly detracts from the idée complète.
It has been some time since I wrote this thing, but if I remember correctly those cat`d arrays do fill by line. I first declared IFS as
I have used it quite a bit and I like it as a teaching method. I have thought on how to improve it. One idea was to parse out the verbs from the sentences and put the whole conjugation of the verb in the upper left hand corner.
If I were to take it to such a greater elevation of learning aide than I would certainly do it in such a way as to promote development. In such a case I'd write the damn thing in C for certain. But for now...
Some other cop outs are the termination of looping the dialogue and exiting the program when it is finished. Basically if the time stamped dialogue has more than 17 lines the program will neglect to account for the 18th and beyond. This is bad practice but it was a quick fix where I was confident that it would be okay. Also, to exit the program it checks if there exists an array element three elements from the current position. If so continue, otherwise exit.
Still another a major issue is finding any subtitles that match your film. Some subtitles start too soon or too late, and almost hilariously enough I wrote another script to solve this problem a few years ago, so if I run into it, I first run this thing then the other. I could have easily combined the two, but again, it was for personal use. Basically you find out where the first spoken dialogue begins in the film, and then simply find out where it begins in the subtitle and add the necessary time gap to each time stamp. So if the subtitle and the film are 3 seconds apart, you run through all the timestamps adding three seconds. Just a little extra.
;Set arrays of srt files by line
;Step through the arrays until you find the first, or next time stamp
;Check stamps against each other
;If they `match' loop through the array until the next time stamp. Stacking all lines. This ensures all subtitles are moved even ones with multiple lines of text
;If best `match' is the next native subtitle then place the native dialogue at the bottom of the stack
... Here I am creating a new word...
Recurse : v. to act recursively.
Also, I hear what you are saying about IFS being a single character. I think that perhaps the " --> " was a remnant of a failed attempt whose cleanup was neglected when the program functioned properly. I think I ended up using the whitespace, which was present in the declaration IFS=" --> "; this is like declaring IFS=" " or "-" or "-" or ">" or " ".
I should have corrected it.
A one liner expressing this point.
That being said I still think a multiple character field separator could be useful.
Also, I see someone moved the thread.
I use the programming forum to give and receive programming help or answer and ask programming questions.
This was something much less than a formal programming effort and that is why I originally put it in general.
I merely wanted to give it to the community in the hopes that someone might find it useful and could use it to broaden their ability to communicate.
Ok, then. I apologize and take back some of my criticism. I honestly overlooked the initial IFS setting (even though it was sitting there staring right at me :doh:). Again, a few comments here and there would've cleared everything up quite quickly (I really can't emphasize that enough--I spent quite a long time trying to piece together what was going on in your functions without getting anywhere and ended up getting rather frustrated).
The whole thing really just caught my eye as a scripting exercise. Truthfully, I don't know much about how the slt format is supposed to work, or exactly how you intended to handle all the complexities of merging two different language files. I just wanted to figure out how to solve the basic problem and, in absence of more detailed requirements, ended up taking the simplest, obvious approach and assumed the subtitles would mostly line up. It looks like you've spent more time thinking about it than I thought in my initial impression.
Now that I know more about it, if and when I have more time I might try looking through it again to see what can be done to improve it. But it'll have to wait for a while now. ;)
Here my version in python...
SRT library needed : http://pypi.python.org/pypi/pysrt
Works with time, support many languages (2 and more), fast.
cin_ 1st version : 239 lines > David : 121 lines > mine 70lines.
Thanks to python, and its libraries!
It just lacks a few more lines to find close timestamps and merge them.
pierrepoulpe have you tested your script?
I ran it through and it seemed to misplace the majority of the subtitles; often creating duplicate entries.
Also it fails to honor characters with accents.
I like the title. Submerge.
yes it's working for me.
could you post the subtitles you use as input?
for accents, I hardcoded iso8859-1 for input encoding, but it may be wrong for your subtitles. I didn't check if there is a way to determine which encoding is a file..
On the screenshot attached, you can see that at the beginning of the movie, it's a big mess. Original subtitles don't have the same number of items, not synchronized at all, etc...
It explains why there are so many titles on output. But when you see it on the movie... it's not so bad, almost ok.
|All times are GMT -5. The time now is 02:46 AM.|