LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Get a list of delimited filenames from a text file (sed?) (http://www.linuxquestions.org/questions/linux-newbie-8/get-a-list-of-delimited-filenames-from-a-text-file-sed-736566/)

Ksearch 06-29-2009 10:40 PM

Get a list of delimited filenames from a text file (sed?)
 
Hi, I'm really new to Bash, so this could sound silly to most of you. I'm trying to get a list of some filenames from a text file. Tried to do this with sed and awk, but couldn't get it to work with my limited knowledge.

This is a sample file content:

<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 13.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 14948) -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
width="471.677px" height="126.604px" viewBox="0 0 471.677 126.604" enable-background="new 0 0 471.677 126.604"
xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="471.667" height="125.333"/>
<text transform="matrix(1 0 0 1 0.0098 8.3701)"><tspan x="0" y="0" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf</tspan><tspan x="0" y="12" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf</tspan><tspan x="0" y="24" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf</tspan></text>
</svg>
What I would like to get from this sample is a new text file with this exact content:

/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

I thought telling sed to print all the matching entries between 'font-size"10">' and '</tspan>' but... the best I got was a file with the whole line contaning my field delimiters.

If you could explain each step done, would be great.

The filenames could be more or less. This 3 are just an example.

pixellany 06-29-2009 10:54 PM

Using your example, the syntax would be simple---ie find all patterns beginning in "/Volumes" and ending in ".pdf"

The Regex would be: "/Volumes.*\.pdf"

So--verify what the criteria should be, and post some sample code. Also, what references (books, tutorials, etc.) are you using?

billymayday 06-29-2009 11:04 PM

Or
Code:

sed -e 's/.*font-size="10">\(.*\)<\/tspan>/\1/' your_input_file
where \(.*\) effectively picks up the pattern between "10"> and </tspan>, and replaces the line with it (\1).

Ksearch 06-30-2009 12:52 AM

Thanks for such a quick reply!

I've tried aready both methods, from billymayday and pixellany. I think I'm getting them both wrong though :b


1) Here is my code for pixellany solution:
#!/bin/bash
DEBUGGINGDIR=/Volumes/Secondary500/Temp
FILE=$DEBUGGINGDIR/*.svg
PRINTFILE=$DEBUGGINGDIR/10pt.txt

cat $FILE | awk -F '/Volumes.*\.pdf' '{print $2;}' > $PRINTFILE

And this is the output I get from it (the input .svg file content is the initially given example):

</tspan></text>

What am I doing wrong?
(I'm learning from a lot of web pages like http://linux.org.mt/article/terminal, http://www.cs.hmc.edu/tech_docs/qref/sed.html, http://ftp.gnu.org/old-gnu/Manuals/s...ter/sed_3.html + google, man pages from apple, since I'm using OS X 10.5, Would you recommend me a good one? thanks.)


2) And this is the code I used for billymayday solution:

#!/bin/bash
DEBUGGINGDIR=/Volumes/Secondary500/Temp
FILE=$DEBUGGINGDIR/*.svg
PRINTFILE=$DEBUGGINGDIR/10pt.txt

sed -e 's/.*font-size="10">\(.*\)<\/tspan>/\1/g' $FILE > $PRINTFILE


And gave me this output (got only the first filename, tried adding "g" afterwards, but didn't work):

<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 13.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 14948) -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
width="362.51px" height="97.437px" viewBox="0 0 362.51 97.437" enable-background="new 0 0 362.51 97.437" xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="362.5" height="96.167"/>
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf</text>
</svg>

What should I fix? I've been trying different approaches, but still can't make it :s

billymayday 06-30-2009 01:07 AM

Yes, well I guess a bit more testing would have helped, huh?

Code:

sed -n -e 's/.*font-size="10">\(.*\)<\/tspan>.*/\1/p' test1
looks better.

pixellany 06-30-2009 06:22 AM

When I suggested the structure of the Regex to be used, I did not mean that you would use it as the field separator in AWK......

Here is just one way to do this in SED:
Code:

sed -n 's/.*\(word\).*/\1/p' filename
Translation:
suppress printing unless stated.
for any line containing "word", replace the entire line with "word", then print.

Will only pick up one instance of "word" per line.....

How about "grep -o"?

ghostdog74 06-30-2009 06:47 AM

minimal regular expression.
Code:

awk 'BEGIN{RS="</tspan>";FS=">"}{ print $NF}' file
output
Code:

# more file
<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 13.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 14948) -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
width="471.677px" height="126.604px" viewBox="0 0 471.677 126.604" enable-background="new 0 0 471.677 126.604"
xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="471.667" height="125.333"/>
<text transform="matrix(1 0 0 1 0.0098 8.3701)"><tspan x="0" y="0" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/T
emp/Untitled-2_Layer 1 copy 2.pdf</tspan><tspan x="0" y="12" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Un
titled-2_Layer 1 copy.pdf</tspan><tspan x="0" y="24" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2
_Layer 1.pdf</tspan></text>
</svg>

# ./testnew.sh
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf


ghostdog74 06-30-2009 06:48 AM

Quote:

Originally Posted by billymayday (Post 3590997)
Yes, well I guess a bit more testing would have helped, huh?

Code:

sed -n -e 's/.*font-size="10">\(.*\)<\/tspan>.*/\1/p' test1
looks better.

unless all of them are on single line (which i doubt), the above only can get 1 result.

billymayday 06-30-2009 06:57 AM

Quote:

Originally Posted by ghostdog74 (Post 3591274)
unless all of them are on single line (which i doubt), the above only can get 1 result.

Don't you mean unless they're all on different lines? If they're on the same line, you'll only get one result.

I didn't spend that long on the data to be honest.

pixellany 06-30-2009 07:26 AM

If I can get the filenames to not have line breaks in them, then this works:

grep -o '/Volumes.*pdf' file

ghostdog74 06-30-2009 07:34 AM

Quote:

Originally Posted by billymayday (Post 3591281)
Don't you mean unless they're all on different lines? If they're on the same line, you'll only get one result.

I didn't spend that long on the data to be honest.

yes, pardon my english. if they are all on the same line then the sed without non-greedy parameter, it will have 1 result..

syg00 06-30-2009 07:45 AM

Quote:

Originally Posted by pixellany (Post 3591317)
If I can get the filenames to not have line breaks in them, then this works:

grep -o '/Volumes.*pdf' file

Not if there are 2 or more one the one line - note ghostdog74s comment on greediosity :p. Try
Code:

grep -Eo "/Volumes[^.]*.pdf" file

pixellany 06-30-2009 07:51 AM

Touche ( I mean: TOO-SHAY....How do I type accented letters here?)

greediosity???? Hmmmm

Now---make it work if there are line breaks in the desired matched patterns........;)

Ksearch 06-30-2009 12:57 PM

Thanks a lot, this line from Pixellany and syg00 output exactly what I was looking for. So I'm gonna learn grep better!

grep -Eo "/Volumes[^.]*.pdf" file

Ghostdog74, the awk line worked as well, but reported a lot lot of empty lines before and between filenames, do you know why? how can that be avoid? (For educational purpose, jeje. I'm gonna need to use awk and sed very soon for a couple of scripts)

billymayday 06-30-2009 05:51 PM

Try http://www.ibm.com/developerworks/li...ry/l-sed1.html as a good sed primer


All times are GMT -5. The time now is 11:38 AM.