Bash :: Dual Language Subtitles

cin_ · 06-03-2012, 09:35 AM

I wrote a quick script to aide in learning a foreign language.
I noticed that my obstacles with learning another language were with parsing spoken words, pronunciation, and sentence syntax stead vocabulary and conjugation.

When thinking on how best to tackle these issues I thought of film. If I could have both the spoken language and my native language subtitles on the screen I could use the spoken language subtitles as reference to parse the syllables I am hearing to their individual words, and then use my native language as reference for new vocabulary.

I often watch movies multiple times purely for enjoyment's sake so watching a film like this adds the extra element of learning.

After a few viewings, I begin just jumping from scene to scene, and recently I have been choosing one character from a dialogue and speaking their parts in response as if I were being spoken to.

All this, and I'm watching one of my favourite films.

Anyway, I'm digging it. Thought I'd share it with you all...

Code:

#!/bin/bash

IFS=$'\012'  # IFS="\n"

echo "FILENAME of your language subtitle: "
read MYL_SUB
echo "FILENAME of foreign language subtitle: "
read FOR_SUB

ARR_MYL=(`cat $MYL_SUB`)
ARR_FOR=(`cat $FOR_SUB`)


rm DBLang.srt


NEX_FOR=0
ARR_POS=-1
CEASE=0
TICK=0



currLineFOR()
{
	ARR_POS_FOR=$NEX_FOR
	CEASE_FOR=0
	FIRST_FOR=0
	while [ $FIRST_FOR -eq 0 ];do
		VAL_FIR_FOR=${ARR_FOR[$ARR_POS_FOR]}
		VAL_FIR3_FOR=${VAL_FIR_FOR:2:1}
		if [ "$VAL_FIR3_FOR" == ":" ]; then
			let FIRST_FOR=1
		else
			let ARR_POS_FOR=ARR_POS_FOR+1	
		fi
	
	done
	
	CUR_LIN_FOR=${ARR_FOR[$ARR_POS_FOR]}
	IFS=":"
	TIM_GAP_FOR=${CUR_LIN_FOR:1:2}
	TIM_GAP_FOR=($CUR_LIN_FOR)
	IFS=" --> "
	arrOW_FOR=(${TIM_GAP_FOR[2]})
	IFS=":"
	
	
	CURR_FOR_HR=$((10#${TIM_GAP_FOR[0]}))
	CURR_FOR_MN=$((10#${TIM_GAP_FOR[1]}))
	CURR_FOR_SC=$((10#${arrOW_FOR[0]:0:2}))
	
	unset IFS
}


nextLineFOR() 
{
	CEASE=0
	NEX_FOR=$(( $ARR_POS_FOR+1 ))
	while [ $CEASE_FOR -eq 0 ] && [ $NEX_FOR -lt $(( $ARR_POS_FOR+17)) ]; do
		VAL_NEX_FOR=${ARR_FOR[$NEX_FOR]}
		VAL_NEX3_FOR=${VAL_NEX_FOR:2:1}
		if [ "$VAL_NEX3_FOR" == ":" ]; then
			let CEASE_FOR=CEASE_FOR+1
			DIF_VAL_FOR=$(( $NEX_FOR-$ARR_POS_FOR ))
		else
			let NEX_FOR=NEX_FOR+1
		fi
	done
	
	NEX_LIN_FOR=${ARR_FOR[$NEX_FOR]}
	IFS=":"
	TIM_NEX_FOR=${NEX_LIN_FOR:1:2}
	TIM_NEX_FOR=($NEX_LIN_FOR)
	IFS=" --> "
	arrOW_FOR_NEX=(${TIM_NEX_FOR[2]})
	IFS=":"
	
	unset IFS
}




currLineFOR
nextLineFOR




nextLine()
{
	CEASE=0
	NEX=$(( $ARR_POS+1 ))
	while [ $CEASE -eq 0 ] && [ $NEX -lt $(( $ARR_POS+17)) ]; do
		VAL_NEX=${ARR_MYL[$NEX]}
		VAL_NEX3=${VAL_NEX:2:1}
		if [ "$VAL_NEX3" == ":" ]; then
			let CEASE=CEASE+1
			DIF_VAL=$(( $NEX-$ARR_POS ))
		else
			let NEX=NEX+1
		fi
	done
	
	NEX_LIN=${ARR_MYL[$NEX]}
	IFS=":"
	TIM_GAP_NEX=${NEX_LIN:1:2}
	TIM_GAP_NEX=($NEX_LIN)
	IFS=" --> "
	arrOW_GAP_NEX=(${TIM_GAP_NEX[2]})
	IFS=":"
	
	CURR_NEX_HR=$((10#${TIM_GAP_NEX[0]}))
	CURR_NEX_MN=$((10#${TIM_GAP_NEX[1]}))
	CURR_NEX_SC=$((10#${arrOW_GAP_NEX[0]:0:2}))
}




loopLineFOR()
{
	CHECK=1
	let DIF_VAL_FOR=DIF_VAL_FOR-2
	while [ $CHECK -lt $DIF_VAL_FOR ]; do
		echo ${ARR_FOR[$(( $ARR_POS_FOR+$CHECK ))]}>>DBLang.srt
		let CHECK=CHECK+1
	done
}




loopLine()
{
	CHECK=1
	let DIF_VAL=DIF_VAL-2
	while [ $CHECK -lt $DIF_VAL ]; do
		echo ${ARR_MYL[$(( $ARR_POS+$CHECK ))]}>>DBLang.srt
		let CHECK=CHECK+1
	done
}




################################################
################################################

checkHIGH()
{
	CURR_LIN_HR=$((10#${TIM_GAP[0]}))
	CURR_LIN_MN=$((10#${TIM_GAP[1]}))
	CURR_LIN_SC=$((10#${arrOW[0]:0:2}))


	CURR_NEX_HR=$((10#${TIM_GAP_NEX[0]}))
	CURR_NEX_MN=$((10#${TIM_GAP_NEX[1]}))
	CURR_NEX_SC=$((10#${arrOW_GAP_NEX[0]:0:2}))
	
	
	CURR_FOR_HR=$((10#${TIM_GAP_FOR[0]}))
	CURR_FOR_MN=$((10#${TIM_GAP_FOR[1]}))
	CURR_FOR_SC=$((10#${arrOW_FOR[0]:0:2}))
	
	GAP_CUR_FOR_HR=$(( $CURR_LIN_HR-$CURR_FOR_HR ))
	GAP_NEX_FOR_HR=$(( $CURR_NEX_HR-$CURR_FOR_HR ))
	
	if [ $CURR_LIN_HR == 1 ]; then
		CURR_LIN_DEC=`echo -e "scale=2; ($CURR_LIN_SC/60)+$CURR_LIN_MN+60" | bc`
	else
		CURR_LIN_DEC=`echo -e "scale=2; ($CURR_LIN_SC/60)+$CURR_LIN_MN" | bc`
	fi
		
	if [ $CURR_FOR_HR == 1 ]; then
		CURR_FOR_DEC=`echo -e "scale=2; ($CURR_FOR_SC/60)+$CURR_FOR_MN+60" | bc`
	else
		CURR_FOR_DEC=`echo -e "scale=2; ($CURR_FOR_SC/60)+$CURR_FOR_MN" | bc`
	fi	
	
	if [ $CURR_NEX_HR == 1 ]; then
		CURR_NEX_DEC=`echo -e "scale=2; ($CURR_NEX_SC/60)+$CURR_NEX_MN+60" | bc`
	else
		CURR_NEX_DEC=`echo -e "scale=2; ($CURR_NEX_SC/60)+$CURR_NEX_MN" | bc`
	fi


	GAP_CUR_FOR_DC=`echo -e "scale=2; ($CURR_LIN_DEC-$CURR_FOR_DEC)*100" | bc`
	GAP_CUR_FOR_IN=`echo -e "scale=0; $GAP_CUR_FOR_DC/1" | bc`
	
	GAP_NEX_FOR_DC=`echo -e "scale=2; ($CURR_NEX_DEC-$CURR_FOR_DEC)*100" | bc`
	GAP_NEX_FOR_IN=`echo -e "scale=0; $GAP_NEX_FOR_DC/1" | bc`
	
	

	if [ ${GAP_CUR_FOR_IN#-} -lt ${GAP_NEX_FOR_IN#-} ]; then
		if [ -z "${ARR_FOR[$(( $ARR_POS_FOR+3 ))]}" ];then
			echo -e "\ni7h43"
			exit
		fi
		loopLineFOR
		currLineFOR
		nextLineFOR
		checkHIGH
	else
		loopLine
		echo -ne "    $CURR_LIN_HR:$CURR_LIN_MN:$CURR_LIN_SC    \r"
		let TICK=TICK+1
	fi
}


################################################
################################################



IFS=$'\012'  # IFS="\n"
for CUR_LIN in ${ARR_MYL[@]}
do
	let ARR_POS=ARR_POS+1
	LIN_TRD=${CUR_LIN:2:1}
	if [ "$LIN_TRD" == ":" ];then
		IFS=":"
		TIM_GAP=${CUR_LIN:1:2}
		TIM_GAP=($CUR_LIN)
		IFS=" --> "
		arrOW=(${TIM_GAP[2]})
		IFS=":"
		
		nextLine
		echo -e "\n$TICK\n$CUR_LIN">>DBLang.srt
		checkHIGH
	fi
done
exit
unset IFS

... note that the subtitles you download have to be of this format:

Quote:

1
00:01:59,980 --> 00:02:01,680
What are you reading?

... I looked around and this seems to be the most prevalent format so it is the one I used. The arrow is a necessity.

kbp · 06-04-2012, 09:32 PM

I haven't tried it yet but it looks like a great idea, well done.

pierrepoulpe · 06-09-2012, 03:41 PM

Wonderful!
exactly what I was looking for. The most funny is that I didn't expect to find a solution for linux, and accepted to run some win32 soft in VM.

I have exactly the same problem as you, on one side if I don't read the english subtitle (in case of english-spoken movie), I don't understand (=parsing) the words.
And sometimes, there are words I don't know, having my native language (french) just beside is now perfect.

BTW, it worked like a charm.

Thanks,

cin_ · 06-12-2012, 12:50 AM

pierrepoulpe, wow, thanks for the support. I am glad you found this and are putting it to good use.

Out of curiosity how did you come across this post? If you came from Search what was your wording for the search so I can make the post more search friendly.

pierrepoulpe · 06-12-2012, 02:25 AM

I think I googled "dual language subtitle" or "subtitles dual language", something like that, which is almost the title of the thread...

Typical thing hard to found, you don't know which keyword a potential author may have used.

David the H. · 06-12-2012, 12:22 PM

I'm finding your post highly confusing.

First of all, you never explained exactly what the script is designed to do. I had to try it out for myself to discover what it does (it appears to simply combine two srt-format subtitle files into one that displays both languages, BTW).

I also see some coding errors and other weak scripting points. This, for example:

Code:

IFS=" --> "

IFS doesn't work like that. It can't be used to define a multi-character delimiting string. It only treats each (and every) character in it as an individual delimiter.

I'd like to try my hand at making modifications and fixes to the script, but since you failed to include any comment lines, I can't quite figure out what all of your functions are supposed to be doing.(Good coders always detail what their code is doing inside the script. Not just for others, but for themselves. I guarantee that a few years down the line you'll be wondering what some of that code is doing.) Would you care to explain them, and the overall code flow?

One thing I'm not sure about, for example, is what happens if any of the timing lines in the two files are not the same. Does it compare them in any way?

In any case I'm pretty sure that it could be made more efficient and robust with just a little work. I already see one potential improvement that could simplify the whole thing quite a bit. I just need to understand what's already there first.

pierrepoulpe · 06-12-2012, 01:39 PM

I feel you a bit tough with our friend. It's not claimed to be a high quality project...
He had a need, the same than me BTW, and didn't found a solution. He wrote a piece of code on table's corner - that is working btw -, and just share it...

I say thanks. It save me the time to code the same thing. We can do better? for sure. He could also have kept the code for its own...

David the H. · 06-13-2012, 12:36 PM

I'm not trying to be harsh. I'm just pointing out that it pays to be explicit in both coding and internet posting. My main desire was to offer advice on how to improve the script, but I found that I couldn't effectively do that because of the difficulty I had in even understanding it. It's very tiring and frustrating having to trudge through a complex, code-only script like this and try to interpret what it does, when a few simple comments could save so much time and trouble for everyone (including the OP).

Anyway, while I do agree that it's a commendable effort, and that it generally gets the job done, it does suffer from a very large flaw that makes the subsequent code ten times more complex than it needs to be. Specifically it comes down to these two lines:

Code:

ARR_MYL=(`cat $MYL_SUB`)
ARR_FOR=(`cat $FOR_SUB`)

The use of cat in this way sets two arrays that store one entry for each word in the files. This means that the majority of the following code is there simply to reassemble everything back into lines again. Why? If the code had instead been designed to store the files as one line per array entry in the first place, a whole lot of effort could have been avoided.

Anyway, I took an interest in this (for some reason), and actually spent several hours writing my own version of the script. It avoids a lot of the previous complexity and errors, and makes it shorter, more stable, and more efficient. My version doesn't just store the files by line-by-line, it actually stores them according to subtitle block, and indexes them according to the entry numbers already existing in the file. This better ensures that the subtitles match between the files.

I also commented it thoroughly to explain what everything is doing.

As I mentioned yesterday, the main question I had concerns what should happen if the timing info is different in each file. I decided to just ensure that, for each matching subtitle number, the longest possible time period is kept. I think it may end up causing overlapping titles though. I don't have the time or ability to thoroughly test it offhand.

The only other limitation I know of right now is that, due to one of the features I used, it requires an up-to-date version of bash.

Anyway, give it a try if you'd like:

Code:

#!/bin/bash

# This script merges two srt language subtitle files into one that displays
# both subtitles at once.  If the display times are different between the files
# it will use the longest value.

# If no arguments are given print usage message and exit.
if [[ -z $1 ]]; then
	echo "Merges two .srt files into one."
	echo
	echo "Usage: ${0##*/} <infile1> <infile2> [outfile]"
	echo "If no output file is given, it will print to stdout."
	echo
	echo "Requires bash v.4.2 or higher."
	exit 1
fi >&2


# Set the output to a file descriptor that goes to the filename given
# as parameter $3, or to stdout if not supplied.
if [[ -n $3 ]]; then
	exec 6>"$3"
else
	exec 6>&1
fi

shopt -s extquote	# Needed to handle newlines and carriage returns inside parameters.
LANG=C			# Ensure that the locale uses straight ascii, due to string comparisons used.

# The readfile function reads the input srt files and sets two matching arrays
# from them (time and dialog). The actual array names it uses are passed to it
# during execution, while the indexes of the arrays are taken from the files entries.
# This function requires a recent version of bash (4.2+), because it sets dynamic
# array names with printf -v, and older versions don't accept array elements
# in that option.
# $1 is the filename, $2 is the time array name, and $3 is the dialog array name.
readfile() {

	# Set local regexes to match the index line and the time line, plus local variables.
	local re1='^([0-9]+)$' re2='[:].*-->.*[:]' line idx dlg

	# Loop through the input file.
	while read -r line || [[ -n $line ]]; do

		line=${line%$'\r'}	# Remove any trailing dos carriage returns.

		if [[ -z $line ]]; then		# Reset dlg on blank lines.

			dlg=""
			lastline=$line
			continue

		elif [[ -z $lastline && $line =~ $re1 ]]; then	# If the line contains only a number (and was
								# preceded by a blank line to avoid false
								# positives), use it as the next array index.
			idx=${BASH_REMATCH[1]}			
			lastline=$line
			continue
		
		elif [[ $line =~ $re2 ]]; then	# If the line matches the time regex, add it to the tm array.
			
			printf -v "$2[idx]" '%s' "$line"
		
		else				# All other lines get concatenated with newlines between them.
			
			dlg="${dlg:+$dlg$'\n'}$line"
		
		fi
		
		# Set the current dlg array index to the value of dlg.
		printf -v "$3[idx]" '%s' "$dlg"
		lastline=$line

done <"$1"

}

# Call the file-reading function, and pass it the names of the arrays.
readfile "$1" lang1tm lang1dlg
readfile "$2" lang2tm lang2dlg

# See if the two files have an equal number of indexes,
# and print a warning in such cases.
if (( ${#lang1tm[@]} != ${#lang2tm[@]} )); then
	echo
	echo "=================================================================="
	echo "Warning!  The two files do not contain an equal number of indexes."
	echo              "The lines may not merge correctly."
	echo "=================================================================="
	echo
fi >&2

# Get the longest array value.
(( ln = ${#lang1tm[@]} > ${#lang2tm[@]} ? ${#lang1tm[@]} : ${#lang2tm[@]} )) 

# Loop through the arrays.
for (( i=1; i<=ln; i++ )); do

	# Compare start and end times from the two arrays.
	# Simple string comparison turns out to be adequate in this case.
	# Use the lowest start time and the highest end time.
	# In addition, if one array has no entry, use the value of the other.

	start1=${lang1tm[i]%% *}
	end1=${lang1tm[i]##* }
	start2=${lang2tm[i]%% *}
	end2=${lang2tm[i]##* }

	[[ $start1 < $start2 ]] && start=${start1:-$start2} || start=${start2:-$start1}
	[[ $end1 > $end2 ]] && end=${end1:-$end2} || end=${end2:-$end1}

	# Format and print the completed entry to file (or stdout).
	# Note that the output uses unix newlines.  replace \n with \r\n if dos format is required.
		printf "%s\n" "$i"
		printf "%s --> %s\n" "$start" "$end"
		printf "%s\n" "${lang1dlg[i]}"
		printf "%s\n\n" "${lang2dlg[i]}"

done >&6

exit 0

pierrepoulpe · 06-13-2012, 06:16 PM

Hello,

I tried both script with attached two subtitles (to be renamed to .srt).

While it's for the same movie, relying on index might be completely wrong.
English version start with a 'downloaded from....', not the french. already an offset of 1.
English index 3, is translated into two indexes on the french : 2 and 3. By chance, it cancel the first offset....
(yes french is a bit more verbose than english, especially when it's a translation)

So I think we must rely on time. but time are not exactly the same...

Let's take a simple example
en 00:01.00 => 00:03.00 Hello
fr 00:01.20 => 00:03.30 Bonjour

First strategy : when overlaping occurs, stop current title, start a new one with both titles merged.
Quite simple, stable, support one title translated into 2 titles,but maybe not comfortable for reading.

00:01.00 => 00:01.20 Hello

00:01.20 => 00:03.00 Hello
Bonjour

00:03.00 => 00:03.30 Bonjour

Second strategy : make links between close times. Not sure it'll be reliable.

Maybe we can mix both strategies.

About your code David, it look nicer, but I can't read it. Not because of you. But because bash script is unreadable. And honestly, I'm not sure I want to spend effort on this.
If I had to write, I think I'd go to python : installed on many distro, portable, and much nicer to read. And there is a library to parse srt...
http://pypi.python.org/pypi/pysrt/0.2.2

cin_ · 06-14-2012, 05:44 AM

Interesting response...

I appreciate the defense pierrepoulpe and you are right, this was a quick project for myself where development was far from the goal. Yet, I apologise for the lacking comments.

David the H. I thought I described the desired and functional use of the script thoroughly in the original post. Almost embarrassingly too thorough actually.

That being said I find it interesting from what perspective you chose to view the whole thread..?
More interesting is that your complaints are different than my own in regard said script.

Actually I had just been introduced to IFS and that is why I chose bash. A practical practice.
I am much more comfortable in python and if were so inclined to script it in any language I wanted it would have been python.

My biggest complaint is how long the damn thing takes. I blame piping to bc. This would have been easily avoided in a python script as python handles floating point numbers like a song, but again, was self restricting for purposes of experimentation.

This pipe was necessary though in regard to your one major worry.
Matching lines.

I realised very quickly that the saints that make these subtitles determine when they believe a subtitle should start and stop and care little for how other people subbing other languages agree or disagree.

My solution was to give the starting time stamp a numerical value. I worked to come up with some schema that would allow for it to be only an integer... again self restricting... but was allowing this aimless endeavor to be far too distracting so I copped out and piped to bc.

From here I could subtract the starting values of one subtitle from another getting a range between them. Then I matched up the smallest ranges assuming if one subtitle is 1 second off from another, but 10 seconds from the next, clearly the subtitle belongs stacked with the first.

This is the most important element of the script. This is the checkHIGH() function. The heart, but it is still imperfect.

Languages have different grammars so sometimes a sentence will begin stacked perfectly and then the tail of it will show up stacked on the next subtitle. This happens, but its rare and hardly detracts from the idée complète.

It has been some time since I wrote this thing, but if I remember correctly those cat`d arrays do fill by line. I first declared IFS as

Code:

IFS=$'\012'  # IFS="\n"

ensuring that each element terminated at a newline.

I have used it quite a bit and I like it as a teaching method. I have thought on how to improve it. One idea was to parse out the verbs from the sentences and put the whole conjugation of the verb in the upper left hand corner.
If I were to take it to such a greater elevation of learning aide than I would certainly do it in such a way as to promote development. In such a case I'd write the damn thing in C for certain. But for now...

Some other cop outs are the termination of looping the dialogue and exiting the program when it is finished. Basically if the time stamped dialogue has more than 17 lines the program will neglect to account for the 18th and beyond. This is bad practice but it was a quick fix where I was confident that it would be okay. Also, to exit the program it checks if there exists an array element three elements from the current position. If so continue, otherwise exit.

Code:

i7h43

is my personal exit notifier. Just is. Again poor design but I can cop to that.

Still another a major issue is finding any subtitles that match your film. Some subtitles start too soon or too late, and almost hilariously enough I wrote another script to solve this problem a few years ago, so if I run into it, I first run this thing then the other. I could have easily combined the two, but again, it was for personal use. Basically you find out where the first spoken dialogue begins in the film, and then simply find out where it begins in the subtitle and add the necessary time gap to each time stamp. So if the subtitle and the film are 3 seconds apart, you run through all the timestamps adding three seconds. Just a little extra.

...
Some comment`ary
;Set arrays of srt files by line
;Step through the arrays until you find the first, or next time stamp
;Check stamps against each other
;If they `match' loop through the array until the next time stamp. Stacking all lines. This ensures all subtitles are moved even ones with multiple lines of text
;If best `match' is the next native subtitle then place the native dialogue at the bottom of the stack
;recurse

... Here I am creating a new word...
Recurse : v. to act recursively.
hazzah

Also, I hear what you are saying about IFS being a single character. I think that perhaps the " --> " was a remnant of a failed attempt whose cleanup was neglected when the program functioned properly. I think I ended up using the whitespace, which was present in the declaration IFS=" --> "; this is like declaring IFS=" " or "-" or "-" or ">" or " ".
I should have corrected it.

A one liner expressing this point.

Code:

# IFS="shit"; echo "hello shitty husband">FIZL; ARR=(`cat FIZL`); echo ${ARR[0]} .. ${ARR[1]} .. ${ARR[2]} .. ${#ARR[@]}; rm FIZL
.. ello  .. .. 9
# IFS=" shit "; echo "hello shitty husband">FIZL; ARR=(`cat FIZL`); echo ${ARR[0]} .. ${ARR[1]} .. ${ARR[2]} .. ${#ARR[@]}; rm FIZL
.. ello .. .. 9
# IFS=" "; echo "hello shitty husband">FIZL; # ARR=(`cat FIZL`); echo ${ARR[0]} .. ${ARR[1]} .. ${ARR[2]} .. ${#ARR[@]}; rm FIZL
hello .. shitty .. husband .. 3
# IFS=" s "; echo "hello shitty husband">FIZL; # ARR=(`cat FIZL`); echo ${ARR[0]} .. ${ARR[1]} .. ${ARR[2]} .. ${#ARR[@]}; rm FIZL
hello .. hitty .. hu .. 4
# IFS="s"; echo "hello shitty husband">FIZL; ARR=(`cat FIZL`); echo ${ARR[0]} .. ${ARR[1]} .. ${ARR[2]} .. ${#ARR[@]}; rm FIZL
hello  .. hitty hu .. band .. 3

... I wrote the sentence to file then cat`d to an array to mimic the script as best I could.

That being said I still think a multiple character field separator could be useful.

cin_ · 06-14-2012, 04:33 PM

Also, I see someone moved the thread.
That's fine.

I use the programming forum to give and receive programming help or answer and ask programming questions.
This was something much less than a formal programming effort and that is why I originally put it in general.

I merely wanted to give it to the community in the hopes that someone might find it useful and could use it to broaden their ability to communicate.

David the H. · 06-14-2012, 07:30 PM

Ok, then. I apologize and take back some of my criticism. I honestly overlooked the initial IFS setting (even though it was sitting there staring right at me

). Again, a few comments here and there would've cleared everything up quite quickly (I really can't emphasize that enough--I spent quite a long time trying to piece together what was going on in your functions without getting anywhere and ended up getting rather frustrated).

The whole thing really just caught my eye as a scripting exercise. Truthfully, I don't know much about how the slt format is supposed to work, or exactly how you intended to handle all the complexities of merging two different language files. I just wanted to figure out how to solve the basic problem and, in absence of more detailed requirements, ended up taking the simplest, obvious approach and assumed the subtitles would mostly line up. It looks like you've spent more time thinking about it than I thought in my initial impression.

Now that I know more about it, if and when I have more time I might try looking through it again to see what can be done to improve it. But it'll have to wait for a while now.

pierrepoulpe · 06-14-2012, 09:05 PM

Here my version in python...
SRT library needed : http://pypi.python.org/pypi/pysrt

Works with time, support many languages (2 and more), fast.

cin_ 1st version : 239 lines > David : 121 lines > mine 70lines.
Thanks to python, and its libraries!

It just lacks a few more lines to find close timestamps and merge them.

Code:

#!/usr/bin/env python
# -*- coding: utf8 -*-
from pysrt import SubRipFile
from pysrt import SubRipItem
import sys

#need 3 arguments at least, (4 because first is the program name itself)
if len(sys.argv) < 4:
	print "usage : ./submerge.py lang1.srt lang2.srt [langn.srt ...] out.srt"
	sys.exit()

#load and parse input subtitles, merge in one list
subs = SubRipFile.open(sys.argv[1], encoding='iso-8859-1')
for i in range(2,(len(sys.argv) - 1)):
	subs.extend(SubRipFile.open(sys.argv[i], encoding='iso-8859-1'))

#sort all languages mixed by start time
subs.sort(key=lambda item:item.start)

#init output file
out = SubRipFile()

#init the buffer : the current displayed subtitles (0..n). Loads the first title to start, and curTime to the first start time
buffer = list()
buffer.append(subs.pop(0))
curTime = buffer[0].start

while len(subs) > 0:
	#init output item
	itemMerge = SubRipItem() 

	#load in the buffer all items starting at the same time
	while len(subs) > 0 and buffer[0].start == subs[0].start:
		buffer.append(subs.pop(0))

	#merge text from all items in buffer
	for item in buffer:
		itemMerge.text = itemMerge.text + item.text + '\n'
	itemMerge.text.strip() #remove last \n

	#output start time
	itemMerge.start = curTime

	#find the first item in buffer that will end	
	firstEnd = min(buffer,key=lambda item:item.end).end

	#if next title start before the first in buffer to end, we add the next title to the buffer, and end the current output item to the start of the new item
	if len(subs) > 0 and subs[0].start < firstEnd:
		curTime = subs[0].start
		buffer.append(subs.pop(0))
		itemMerge.end = curTime

	#else, we output item is ended with the first item in buffer to end
	else:
		curTime = firstEnd
		itemMerge.end = firstEnd

		#we remove from buffer all items ending at the same time
		buffer = [item for item in buffer if item.end > firstEnd]

		#if buffer is empty, we load the next one.
		if len(subs) > 0 and len(buffer) == 0 :
			curTime = subs[0].start
			buffer.append(subs.pop(0))

	#add item to output file
	out.append(itemMerge)

#write to disk ouput file
out.save(sys.argv[-1])

cin_ · 06-15-2012, 03:30 AM

pierrepoulpe have you tested your script?

I ran it through and it seemed to misplace the majority of the subtitles; often creating duplicate entries.
Also it fails to honor characters with accents.

I like the title. Submerge.

pierrepoulpe · 06-15-2012, 04:42 AM

yes it's working for me.

could you post the subtitles you use as input?
for accents, I hardcoded iso8859-1 for input encoding, but it may be wrong for your subtitles. I didn't check if there is a way to determine which encoding is a file..

On the screenshot attached, you can see that at the beginning of the movie, it's a big mess. Original subtitles don't have the same number of items, not synchronized at all, etc...
It explains why there are so many titles on output. But when you see it on the movie... it's not so bad, almost ok.