[SOLVED] Need some help with a Bash script

David the H. · 05-04-2010, 10:01 AM

Quote:

Originally Posted by catkin

grail suggested it in the second post of the thread.

Why so he did. In fact, it looks like you could even skip that step entirely and go straight to:

Code:

if [[ ${lineoftext:0:1} == "0" ]]; then
   ...do all the other stuff...
fi

I see other things that can be simplified too. There are entirely way too many sed statements, for one thing. I'm sure they could be condensed to extract the desired strings in a single step. And what's the use of "category=2"? It seems to have no purpose at all in the script.

LUB997 · 05-05-2010, 03:55 AM

Yes, those carriage returns sure are annoying... At least I'll certainly remember that in the future when converting files over from Windows to Linux. My goal is not to have to, and to only use Linux programs and just not use Windows, but unfortunately, there are still about 1 or 2 programs that don't have good Linux equivalents, such as Family Tree Maker, AnyDVD, and Netflix Watch Now. I think that will change though, when you look at all the progress Linux has made over even just the past few years. 5 years ago I dreamed of being able to eliminate Windows and only use Linux, but didn't see it as a reality; today Linux has come so far that I run nothing but Linux on all my computers and just keep Windows XP safely contained in a virtual machine where it can't harm anything to run 1 or 2 programs. Who knows how long it will take before I can even go ahead and delete my virtual machine, but until then, you can bet I'll be remembering to check for carriage returns after this.

Thanks to everyone for all suggestions! I really do appreciate also the comments people made on simplifying code. Like I said before, although I am experienced with C#, I am definitely not experienced with shell scripting, so things that might make sense to me in C# are not common knowledge to me yet in shell scripting, and I have to just kind of go with the examples I see on the internet to learn how to do it as I go, so any comments about how to do it more simply are very useful. The code has evolved quite a lot in the past few days and is now actually putting out useful output, though it still has a ways go to in order to get it to do exactly what I want it to do, and I have really been noticing how it would be nice to simplify it, so I will keep all of your suggestions in mind.

I believe David asked what is the reason for category 2... David, you are right; category 2 did not have a purpose YET in the code as it was posted previously. However, I was looking ahead, and knew that it was going to have a purpose, which is why it was there. If you look at the example of the file format I am dealing with, you might notice that each line begins with 0, 1, 2, or very rarely 3. I am never going to use 3 for my purposes, so 3 is not there. The way the file format works (in general) is that category 0 lines represent something new, such as a new individual or a new family. Category 1 lines identify some type of event about that something new identified by category 0, such as category 0's birth, death, etc. Then, category 2 lines tell even more specific details about that category 1 event, such as the event's place or date. I will definitely need category 2 lines in order to read in where people were in each census year, since to sort out who was in a given location in what year, I have to read that information in in the first place, and that will use category 2 lines. Not sure who came up with such an odd file format, but I like it because it wasn't too difficult to figure out how it worked. Anyone really curious about it about how family tree files work, if you take a look, at each individual, you might also notice that they have a line that says FAMC. FAMC identifies the family that the individual was born from. Some individuals also have one or more lines with FAMS. FAMS means that the individual is the father or mother of that family unit. That means an individual can have multiple FAMS lines, but only one FAMC line. Since I've figured out how all this works, maybe at some point I'll make a descent Linux family tree program, but it would have to use Mono since I'm best at C#. I guess that's ok though, now that Mono has come as far as it has. It does almost everything the .NET environment on Windows does at this point. There are already some genealogy programs for Linux, but none of them are anywhere near as professionally done as Family Tree Maker. The only one that comes close is Gramps, and I've used it and didn't like the layout of the GUI at all. If I do make a program like that, then we just need AnyDVD and Netflix Watch Now on Linux, and I'll be all set to delete my Windows virtual machine. Lots of people say those things won't happen, but 5 years ago lots of people said we wouldn't be doing all of the things we are doing on Linux today. Anyway, it's coming along well now, and thanks for all the suggestions that everyone gave!

catkin · 05-05-2010, 04:40 AM

Glad you found the replies useful and thank you for sharing what the code is for

David the H. · 05-05-2010, 07:24 AM

Yes, thank you for the explanation. I thought you might be planning ahead for further additions.

I've been trying to figure out exactly what this section is supposed to do, and how to simplify it. As best as I can tell, you're trying to extract the number from the line, correct?

Code:

lineoftext=`echo $lineoftext | sed "s/0\ \@I// g"`;
lineoftext=`echo $lineoftext | sed "s/\@\ INDI\r// g"`;
lineoftext="$(echo $lineoftext | sed 's/0*//')";
individual=$lineoftext;

First of all, sed can apply multiple expressions at once using the -e option.

Code:

lineoftext=$(echo "$lineoftext"|sed -e 's/0\ \@I//' -e 's/\@\ INDI\r//' -e 's/0*//')

Note that the "g" command is unnecessary here, and the third expression appears to be superfluous, unless you want to strip off all leading zeroes.

A simpler version would be something like this:

Code:

lineoftext=$(echo "$lineoftext"|sed -r 's/.*@I([0-9]+)@.*/\1/')

But I've figured out how to strip everything down to just the number using only bash's built-in functions. Using built-ins over external commands usually improves efficiency.

Taking your sample text above, and converting the file to dos-encoding, I did this:

Code:

$ line=$(head -n1 file.txt)  #read the first line from the dos-encoded file.
$ cat -v <<<$line            #displays non-printing characters.
0 @I1039@ INDI^M

$ line="${line:0:$((${#line}-1))}"   #strip the last character (the cr) from the line.  This is the tricky part, and actually unnecessary, since you'll be stripping off the ending below anyway. ;)
$ cat -v <<<$line
0 @I1039@ INDI

$ line=${line#*@I}   #strip everything up to and including @I.
$ line=${line%%@*}   #strip everything from the @ to the end.
cat -v <<<$line
1039

If you can guarantee that the line will always have only the pattern above, and the desired part is always a number, you can make it even easier.

Code:

line=${line//[^0-9]}
line=${line:1:${#line}}  #or even just "line=${line#0}"

petrus4 · 05-05-2010, 08:35 AM

I don't consider the data format particularly well designed, myself; although it is workable.

If there's still interest, I will post my own solution to this problem, although it will likely be reasonably long, and focus initially on cleaning up the data format, so I've got something better to work with. I won't be using arrays, though.

grail · 05-05-2010, 09:02 AM

@OP - just thought I would mention that a good rule of thumb when I deal with M$ files is to run them through dos2unix to get rid of any nasties