[SOLVED] Bash script to parse the Windows version written inside the license.rtf file

marietto · 09-19-2010, 07:57 AM

Hello,

Since I'm totally newbie in bash scripting,can someone help me to write a bash script to parse the version of Windows written inside the license.rtf file ? Thanks.

It starts with :

MICROSOFT SOFTWARE LICENSE TERMS

WINDOWS 7 ULTIMATE

These license terms are an agreement between Microsoft Corporation...

So I need to grab the version of Windows,in this case it is :

WINDOWS 7 ULTIMATE

Thanks.

thanks in advance.

David the H. · 09-19-2010, 08:04 AM

So, what you need is simply the second line? That should be a piece of cake using sed or a similar tool. There are only two difficulties I see that may need to be considered first.

1. Windows text files use a different line ending than unix files. The file would need to be converted before it could be properly parsed. This is easy to do.

2. Rich text is a kind of markup language, similar to html. What you see in the display is not exactly the text the file itself contains. The parsing would have to take this into account.

Can we see the contents of the file as displayed in a regular text editor?

marietto · 09-19-2010, 08:18 AM

I need to grab the string "Windows 7 ULTIMATE" or whatever version it is. It could be

"Windows 7 STARTER"
"Windows 7 HOME BASIC"
"Windows 7 HOME PREMIUM"
"Windows 7 PROFESSIONAL"
"Windows 7 ENTERPRISE"
"Windows 7 ULTIMATE"

or

"Windows Vista HOME BASIC"
"Windows Vista HOME PREMIUM"
"Windows Vista ULTIMATE"
"Windows Vista BUSINESS"
"Windows Vista ENTERPRISE"

I did a screenshot of the license.rtf file :

http://www.flickr.com/photos/2668797...n/photostream/

David the H. · 09-19-2010, 08:23 AM

Ok, that helps. Although it would be more convenient if you could copy the actual text here instead of making me work through a screenshot. We'd only need the text up to the line in question, and maybe one or two lines after it.

I see that the line you want is the third line. I'll work on that for now. But now the question becomes, does every version of the file conform to the same format? How flexible does the extraction parsing need to be?

marietto · 09-19-2010, 08:29 AM

These are the first lines of the file :

{\rtf1\ansi\ansicpg1252\deff0\deflang1033\deflangfe2052\deftab360{\fonttbl{\f0\fswiss\fprq2\fcharset 0 Tahoma;}}
{\*\generator Msftedit 5.41.21.2508;}\viewkind4\uc1\pard\nowidctlpar\sb120\sa120\b\f0\fs20 MICROSOFT SOFTWARE LICENSE TERMS\par
\pard\brdrb\brdrs\brdrw10\brsp20 \nowidctlpar\sb120\sa120 WINDOWS 7 ULTIMATE\par
\pard\nowidctlpar\sb120\sa120\b0 These license terms are an agreement between Microsoft Corporation (or based on where you live, one of its affiliates) and $
\pard\nowidctlpar\sb120\sa120\tx0\'b7\tab updates,\par
\'b7\tab supplements,\par
\'b7\tab Internet-based services, and\par
\'b7\tab support services\par
\pard\nowidctlpar\sb120\sa120 for this software, unless other terms accompany those items. If so, those terms apply.\par
\b By using the software, you accept these terms. If you do not accept them, do not use the software. Instead, return it to the retailer for a refund or cre$
\b As described below, using the software also operates as your consent to the transmission of certain computer information during activation, validation an$
\pard\brdrt\brdrs\brdrw10\brsp20 \nowidctlpar\sb120\sa120 If you comply with these license terms, you have the rights below for each license you acquire.\par
\pard\nowidctlpar\sb120\sa120\tx360 1.\tab OVERVIEW.\par

marietto · 09-19-2010, 08:44 AM

whatever version it is,the format is the same,it is always on the third line,regardless of the different length.

David the H. · 09-19-2010, 08:59 AM

Thank you. But please use [code][/code] tags around preformatted text in the future.

If you assume the text you want will always be on the third line, will always be in approximately the same format, and that it will always begin with the word WINDOWS (that's a lot of assumptions!), then this should do it.

Code:

sed -rn -e '1,3 s/.$//' -e '3 s/^.+(WINDOWS[^\]+)\\par$/\1/p' licence.rtf

The first -e expression converts the line endings on the first three lines to unix format, then the second -e extracts everything from the word WINDOWS to the final \par, if it exists, from the third line.

If you need it to be more flexible than that, then you'll have to show us all the possible variations you may encounter.

And here are a few useful sed references:
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt

Edit: After thinking about it a bit, I can make it even shorter. By taking the dos-mode carriage return into account inside the main expression, I can eliminate the first one entirely. This should work no matter which format it's in:

Code:

sed -rn '3 s/^.+(WINDOWS[^\]+)\\par\r?$/\1/p' licence.rtf

I'm still trying to figure out if there'd be an easy way to extract the text if it didn't start with the word "WINDOWS".

marietto · 09-19-2010, 10:09 AM

Right now it works,thanks.

David the H. · 09-19-2010, 12:13 PM

Just to follow up, I decided to try again with a slightly different track, and came up with an even more simplified version.

Code:

sed -rn '3 s/\\[^ ]+ ?|\r$//gp' licence.rtf

This one simply removes from the line any string that starts with a \, and possible trailing spaces, as well as any dos carriage returns. This also has the advantage in that it can be used on any line that carries this pattern.

There's no easy way to work with lines that have {} braces, however, because they can be nested, and that's really hard to deal with in sed. For more complex operations like that, you may want to look at something like unrtf instead. I just tried it and it appears to work well, although you have to compensate for the header it adds when filtering out the line(s) you want.

Code:

unrtf --nopict --text e_file.txt 2>/dev/null | sed -n 9p

The 2>/dev/null is only there to get rid of the program info that's displayed, but since that appears on stderr, it shouldn't affect scripting much anyway. The program could really use a non-verbose option of some kind.

marietto · 09-19-2010, 02:56 PM

this also work :

mario@mario-desktop:/media/FREEDOS/sh$ cat license.rtf | grep "WINDOWS" | cut -d'\' -f9 | cut -d' ' -f "2 3 4"

WINDOWS 7 ULTIMATE

now I have an easier problem :

I'm trying to parse the version of Windows XP written inside the eula.txt : this is the beginning :

Microsoft Windows XP Home Edition

END-USER LICENSE AGREEMENT

IMPORTANT-READ CAREFULLY: This End-User

the text I need to grab is always on the first line : in this case it is :

Microsoft Windows XP Home Edition

I did :

ver_windows=$(cat /mnt/sda1/Windows/system32/eula.txt | grep "Microsoft Windows")
echo $ver_windows

Microsoft Windows XP Home Edition

if [ "$ver_windows" = "Microsoft Windows XP Home Edition" ]; then ver_windows_min="xp"
echo home
fi

output : nothing

I think that $ver_windows is not = to "Microsoft Windows XP Home Edition" ,why ?

David the H. · 09-19-2010, 03:50 PM

Well, there are probably a hundred different ways to extract the string using different tools. In general though, it's more efficient to use a single program than a chain of piped commands. sed can do everything grep and cut can do, and more flexibly, so I recommend it.

As for your new request, first of all, there's no need to use cat with grep, since it can read the file directly. The majority of text-based tools can do this.

Code:

grep "Microsoft Windows" eula.txt

But I'll bet you anything that you're running across the line-ending problem again. You see, unix uses LF (line feed) for it's line endings, while dos uses CRLF (carriage return+line feed). When you grepped the file, you most likely grabbed the invisible CR along with the text, which is why the test fails. Try confirming it with this:

Code:

echo $ver_windows |cat -A

It will probably show you this: Microsoft Windows XP Home Edition^M$. ^M is the CR.

Sed is again the better tool here. With it's ability to change text and also target lines based on line number, it can do everything at once.

Code:

ver_windows=$(sed -n '1 s/\r$//p' /mnt/sda1/Windows/system32/eula.txt)

There are other options, such as using read to grab the first line from the file, then strip off the invisible carriage return with a parameter substitution. Indeed, this would be the most efficient method, since it works entirely within bash.

Code:

read ver_windows </mnt/sda1/Windows/system32/eula.txt
ver_windows=${ver_windows%$'\r'}
#note that the extquote shell option muse be enabled for the above to work.

Or, since you seem to need to use Windows files a lot, consider simply converting those files over to unix mode before using them. There are several tools that can do this, such as tofrodos or flip, or the sed command I used above (just remove the "1" to make it affect the whole file, and use the -i option to edit in place).

marietto · 09-19-2010, 05:03 PM

Thanks for your help. I really don't understand.

ver_windows=$(sed -n '1 s/\r$//p' /mnt/sda1/Windows/system32/eula.txt)

--> Microsoft Windows XP Home Edition

echo $ver_windows |cat -A

--> Microsoft Windows XP Home Edition$

if [ "$ver_windows" = "Microsoft Windows XP Home Edition" ]; then ver_windows_min="xp"
echo home
fi

if [ "$ver_windows" = "Microsoft Windows XP Professional Edition" ]; then ver_windows_min="xp"
echo pro
fi

echo $ver_windows_min

---> nothing is displayed.

FIXED : there was a stupid hidden space at the end of the first line.