LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 06-29-2009, 09:40 PM   #1
Ksearch
LQ Newbie
 
Registered: Jun 2009
Posts: 3

Rep: Reputation: 0
Unhappy Get a list of delimited filenames from a text file (sed?)


Hi, I'm really new to Bash, so this could sound silly to most of you. I'm trying to get a list of some filenames from a text file. Tried to do this with sed and awk, but couldn't get it to work with my limited knowledge.

This is a sample file content:

<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 13.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 14948) -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
width="471.677px" height="126.604px" viewBox="0 0 471.677 126.604" enable-background="new 0 0 471.677 126.604"
xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="471.667" height="125.333"/>
<text transform="matrix(1 0 0 1 0.0098 8.3701)"><tspan x="0" y="0" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf</tspan><tspan x="0" y="12" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf</tspan><tspan x="0" y="24" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf</tspan></text>
</svg>
What I would like to get from this sample is a new text file with this exact content:

/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

I thought telling sed to print all the matching entries between 'font-size"10">' and '</tspan>' but... the best I got was a file with the whole line contaning my field delimiters.

If you could explain each step done, would be great.

The filenames could be more or less. This 3 are just an example.
 
Old 06-29-2009, 09:54 PM   #2
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
Using your example, the syntax would be simple---ie find all patterns beginning in "/Volumes" and ending in ".pdf"

The Regex would be: "/Volumes.*\.pdf"

So--verify what the criteria should be, and post some sample code. Also, what references (books, tutorials, etc.) are you using?
 
Old 06-29-2009, 10:04 PM   #3
billymayday
Guru
 
Registered: Mar 2006
Location: Sydney, Australia
Distribution: Fedora, CentOS, OpenSuse, Slack, Gentoo, Debian, Arch, PCBSD
Posts: 6,678

Rep: Reputation: 122Reputation: 122
Or
Code:
sed -e 's/.*font-size="10">\(.*\)<\/tspan>/\1/' your_input_file
where \(.*\) effectively picks up the pattern between "10"> and </tspan>, and replaces the line with it (\1).

Last edited by billymayday; 06-29-2009 at 10:06 PM.
 
Old 06-29-2009, 11:52 PM   #4
Ksearch
LQ Newbie
 
Registered: Jun 2009
Posts: 3

Original Poster
Rep: Reputation: 0
Thanks for such a quick reply!

I've tried aready both methods, from billymayday and pixellany. I think I'm getting them both wrong though :b


1) Here is my code for pixellany solution:
#!/bin/bash
DEBUGGINGDIR=/Volumes/Secondary500/Temp
FILE=$DEBUGGINGDIR/*.svg
PRINTFILE=$DEBUGGINGDIR/10pt.txt

cat $FILE | awk -F '/Volumes.*\.pdf' '{print $2;}' > $PRINTFILE

And this is the output I get from it (the input .svg file content is the initially given example):

</tspan></text>

What am I doing wrong?
(I'm learning from a lot of web pages like http://linux.org.mt/article/terminal, http://www.cs.hmc.edu/tech_docs/qref/sed.html, http://ftp.gnu.org/old-gnu/Manuals/s...ter/sed_3.html + google, man pages from apple, since I'm using OS X 10.5, Would you recommend me a good one? thanks.)


2) And this is the code I used for billymayday solution:

#!/bin/bash
DEBUGGINGDIR=/Volumes/Secondary500/Temp
FILE=$DEBUGGINGDIR/*.svg
PRINTFILE=$DEBUGGINGDIR/10pt.txt

sed -e 's/.*font-size="10">\(.*\)<\/tspan>/\1/g' $FILE > $PRINTFILE


And gave me this output (got only the first filename, tried adding "g" afterwards, but didn't work):

<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 13.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 14948) -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
width="362.51px" height="97.437px" viewBox="0 0 362.51 97.437" enable-background="new 0 0 362.51 97.437" xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="362.5" height="96.167"/>
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf</text>
</svg>

What should I fix? I've been trying different approaches, but still can't make it :s
 
Old 06-30-2009, 12:07 AM   #5
billymayday
Guru
 
Registered: Mar 2006
Location: Sydney, Australia
Distribution: Fedora, CentOS, OpenSuse, Slack, Gentoo, Debian, Arch, PCBSD
Posts: 6,678

Rep: Reputation: 122Reputation: 122
Yes, well I guess a bit more testing would have helped, huh?

Code:
 sed -n -e 's/.*font-size="10">\(.*\)<\/tspan>.*/\1/p' test1
looks better.
 
Old 06-30-2009, 05:22 AM   #6
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
When I suggested the structure of the Regex to be used, I did not mean that you would use it as the field separator in AWK......

Here is just one way to do this in SED:
Code:
sed -n 's/.*\(word\).*/\1/p' filename
Translation:
suppress printing unless stated.
for any line containing "word", replace the entire line with "word", then print.

Will only pick up one instance of "word" per line.....

How about "grep -o"?
 
Old 06-30-2009, 05:47 AM   #7
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
minimal regular expression.
Code:
awk 'BEGIN{RS="</tspan>";FS=">"}{ print $NF}' file
output
Code:
# more file
<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 13.0.1, SVG Export Plug-In . SVG Version: 6.00 Build 14948) -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
width="471.677px" height="126.604px" viewBox="0 0 471.677 126.604" enable-background="new 0 0 471.677 126.604"
xml:space="preserve">
<rect x="0.01" y="1.27" fill="none" width="471.667" height="125.333"/>
<text transform="matrix(1 0 0 1 0.0098 8.3701)"><tspan x="0" y="0" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/T
emp/Untitled-2_Layer 1 copy 2.pdf</tspan><tspan x="0" y="12" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Un
titled-2_Layer 1 copy.pdf</tspan><tspan x="0" y="24" font-family="'MyriadPro-Regular'" font-size="10">/Volumes/Secondary500/Temp/Untitled-2
_Layer 1.pdf</tspan></text>
</svg>

# ./testnew.sh
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy 2.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1 copy.pdf
/Volumes/Secondary500/Temp/Untitled-2_Layer 1.pdf

Last edited by ghostdog74; 06-30-2009 at 05:49 AM.
 
Old 06-30-2009, 05:48 AM   #8
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by billymayday View Post
Yes, well I guess a bit more testing would have helped, huh?

Code:
 sed -n -e 's/.*font-size="10">\(.*\)<\/tspan>.*/\1/p' test1
looks better.
unless all of them are on single line (which i doubt), the above only can get 1 result.
 
Old 06-30-2009, 05:57 AM   #9
billymayday
Guru
 
Registered: Mar 2006
Location: Sydney, Australia
Distribution: Fedora, CentOS, OpenSuse, Slack, Gentoo, Debian, Arch, PCBSD
Posts: 6,678

Rep: Reputation: 122Reputation: 122
Quote:
Originally Posted by ghostdog74 View Post
unless all of them are on single line (which i doubt), the above only can get 1 result.
Don't you mean unless they're all on different lines? If they're on the same line, you'll only get one result.

I didn't spend that long on the data to be honest.
 
Old 06-30-2009, 06:26 AM   #10
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
If I can get the filenames to not have line breaks in them, then this works:

grep -o '/Volumes.*pdf' file
 
Old 06-30-2009, 06:34 AM   #11
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by billymayday View Post
Don't you mean unless they're all on different lines? If they're on the same line, you'll only get one result.

I didn't spend that long on the data to be honest.
yes, pardon my english. if they are all on the same line then the sed without non-greedy parameter, it will have 1 result..
 
Old 06-30-2009, 06:45 AM   #12
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,239

Rep: Reputation: 1020Reputation: 1020Reputation: 1020Reputation: 1020Reputation: 1020Reputation: 1020Reputation: 1020Reputation: 1020
Quote:
Originally Posted by pixellany View Post
If I can get the filenames to not have line breaks in them, then this works:

grep -o '/Volumes.*pdf' file
Not if there are 2 or more one the one line - note ghostdog74s comment on greediosity . Try
Code:
grep -Eo "/Volumes[^.]*.pdf" file
 
Old 06-30-2009, 06:51 AM   #13
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
Touche ( I mean: TOO-SHAY....How do I type accented letters here?)

greediosity???? Hmmmm

Now---make it work if there are line breaks in the desired matched patterns........
 
Old 06-30-2009, 11:57 AM   #14
Ksearch
LQ Newbie
 
Registered: Jun 2009
Posts: 3

Original Poster
Rep: Reputation: 0
Talking

Thanks a lot, this line from Pixellany and syg00 output exactly what I was looking for. So I'm gonna learn grep better!

grep -Eo "/Volumes[^.]*.pdf" file

Ghostdog74, the awk line worked as well, but reported a lot lot of empty lines before and between filenames, do you know why? how can that be avoid? (For educational purpose, jeje. I'm gonna need to use awk and sed very soon for a couple of scripts)
 
Old 06-30-2009, 04:51 PM   #15
billymayday
Guru
 
Registered: Mar 2006
Location: Sydney, Australia
Distribution: Fedora, CentOS, OpenSuse, Slack, Gentoo, Debian, Arch, PCBSD
Posts: 6,678

Rep: Reputation: 122Reputation: 122
Try http://www.ibm.com/developerworks/li...ry/l-sed1.html as a good sed primer
 
  


Reply

Tags
get, list


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
bash script using sed/scp/ssh has issues with delimited file ScottThornley Programming 5 03-18-2009 03:45 PM
using sed to remove line in a comma-delimited file seefor Programming 4 03-10-2009 03:35 PM
how to pick random file name from a list of filenames in a text file. pdklinux79 Linux - Newbie 9 06-20-2008 02:46 PM
Comma-Delimited Website Filenames Apocalypse General 1 11-09-2003 09:05 AM
Parsing a tab delimited text file jajanes Programming 9 08-08-2003 10:34 AM


All times are GMT -5. The time now is 04:31 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration