Get a list of delimited filenames from a text file (sed?)

Ksearch · 06-29-2009, 09:40 PM

Hi, I'm really new to Bash, so this could sound silly to most of you. I'm trying to get a list of some filenames from a text file. Tried to do this with sed and awk, but couldn't get it to work with my limited knowledge.

This is a sample file content:

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
width="471.677px" height="126.604px" viewBox="0 0 471.677 126.604" enable-background="new 0 0 471.677 126.604"
xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="471.667" height="125.333"/>
<text transform="matrix(1 0 0 1 0.0098 8.3701)"><tspan x="0" y="0" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf</tspan><tspan x="0" y="12" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf</tspan><tspan x="0" y="24" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf</tspan></text>
</svg>
What I would like to get from this sample is a new text file with this exact content:

/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

I thought telling sed to print all the matching entries between 'font-size"10">' and '</tspan>' but... the best I got was a file with the whole line contaning my field delimiters.

If you could explain each step done, would be great.

The filenames could be more or less. This 3 are just an example.

pixellany · 06-29-2009, 09:54 PM

Using your example, the syntax would be simple---ie find all patterns beginning in "/Volumes" and ending in ".pdf"

The Regex would be: "/Volumes.*\.pdf"

So--verify what the criteria should be, and post some sample code. Also, what references (books, tutorials, etc.) are you using?

billymayday · 06-29-2009, 10:04 PM

Or

Code:

sed -e 's/.*font-size="10">\(.*\)<\/tspan>/\1/' your_input_file

where $.*$ effectively picks up the pattern between "10"> and </tspan>, and replaces the line with it (\1).

Ksearch · 06-29-2009, 11:52 PM

Thanks for such a quick reply!

I've tried aready both methods, from billymayday and pixellany. I think I'm getting them both wrong though :b

1) Here is my code for pixellany solution:
#!/bin/bash
DEBUGGINGDIR=/Volumes/Secondary500/Temp
FILE=$DEBUGGINGDIR/*.svg
PRINTFILE=$DEBUGGINGDIR/10pt.txt

cat $FILE | awk -F '/Volumes.*\.pdf' '{print $2;}' > $PRINTFILE

And this is the output I get from it (the input .svg file content is the initially given example):

</tspan></text>

What am I doing wrong?
(I'm learning from a lot of web pages like http://linux.org.mt/article/terminal, http://www.cs.hmc.edu/tech_docs/qref/sed.html, http://ftp.gnu.org/old-gnu/Manuals/s...ter/sed_3.html + google, man pages from apple, since I'm using OS X 10.5, Would you recommend me a good one? thanks.)

2) And this is the code I used for billymayday solution:

#!/bin/bash
DEBUGGINGDIR=/Volumes/Secondary500/Temp
FILE=$DEBUGGINGDIR/*.svg
PRINTFILE=$DEBUGGINGDIR/10pt.txt

sed -e 's/.*font-size="10">$.*$<\/tspan>/\1/g' $FILE > $PRINTFILE

And gave me this output (got only the first filename, tried adding "g" afterwards, but didn't work):

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
width="362.51px" height="97.437px" viewBox="0 0 362.51 97.437" enable-background="new 0 0 362.51 97.437" xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="362.5" height="96.167"/>
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf</text>
</svg>

What should I fix? I've been trying different approaches, but still can't make it :s

billymayday · 06-30-2009, 12:07 AM

Yes, well I guess a bit more testing would have helped, huh?

Code:

 sed -n -e 's/.*font-size="10">\(.*\)<\/tspan>.*/\1/p' test1

looks better.

pixellany · 06-30-2009, 05:22 AM

When I suggested the structure of the Regex to be used, I did not mean that you would use it as the field separator in AWK......

Here is just one way to do this in SED:

Code:

sed -n 's/.*\(word\).*/\1/p' filename

Translation:
suppress printing unless stated.
for any line containing "word", replace the entire line with "word", then print.

Will only pick up one instance of "word" per line.....

How about "grep -o"?

ghostdog74 · 06-30-2009, 05:47 AM

minimal regular expression.

Code:

awk 'BEGIN{RS="</tspan>";FS=">"}{ print $NF}' file

output

Code:

# more file
<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 13.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 14948) -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
width="471.677px" height="126.604px" viewBox="0 0 471.677 126.604" enable-background="new 0 0 471.677 126.604"
xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="471.667" height="125.333"/>
<text transform="matrix(1 0 0 1 0.0098 8.3701)"><tspan x="0" y="0" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/T
emp/Untitled-2_Layer 1 copy 2.pdf</tspan><tspan x="0" y="12" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Un
titled-2_Layer 1 copy.pdf</tspan><tspan x="0" y="24" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2
_Layer 1.pdf</tspan></text>
</svg>

# ./testnew.sh
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

ghostdog74 · 06-30-2009, 05:48 AM

Quote:

Originally Posted by billymayday

Yes, well I guess a bit more testing would have helped, huh?

Code:

 sed -n -e 's/.*font-size="10">\(.*\)<\/tspan>.*/\1/p' test1

looks better.

unless all of them are on single line (which i doubt), the above only can get 1 result.

billymayday · 06-30-2009, 05:57 AM

Quote:

Originally Posted by ghostdog74

unless all of them are on single line (which i doubt), the above only can get 1 result.

Don't you mean unless they're all on different lines? If they're on the same line, you'll only get one result.

I didn't spend that long on the data to be honest.

pixellany · 06-30-2009, 06:26 AM

If I can get the filenames to not have line breaks in them, then this works:

grep -o '/Volumes.*pdf' file

ghostdog74 · 06-30-2009, 06:34 AM

Quote:

Originally Posted by billymayday

Don't you mean unless they're all on different lines? If they're on the same line, you'll only get one result.

I didn't spend that long on the data to be honest.

yes, pardon my english. if they are all on the same line then the sed without non-greedy parameter, it will have 1 result..

syg00 · 06-30-2009, 06:45 AM

Quote:

Originally Posted by pixellany

If I can get the filenames to not have line breaks in them, then this works:

grep -o '/Volumes.*pdf' file

Not if there are 2 or more one the one line - note ghostdog74s comment on greediosity

. Try

Code:

grep -Eo "/Volumes[^.]*.pdf" file

pixellany · 06-30-2009, 06:51 AM

Touche ( I mean: TOO-SHAY....How do I type accented letters here?)

greediosity???? Hmmmm

Now---make it work if there are line breaks in the desired matched patterns........

Ksearch · 06-30-2009, 11:57 AM

Thanks a lot, this line from Pixellany and syg00 output exactly what I was looking for. So I'm gonna learn grep better!

grep -Eo "/Volumes[^.]*.pdf" file

Ghostdog74, the awk line worked as well, but reported a lot lot of empty lines before and between filenames, do you know why? how can that be avoid? (For educational purpose, jeje. I'm gonna need to use awk and sed very soon for a couple of scripts)

billymayday · 06-30-2009, 04:51 PM

Try http://www.ibm.com/developerworks/li...ry/l-sed1.html as a good sed primer