[SOLVED] How would I use awk or sed to match this?

ted_chou12 · 04-08-2011, 01:58 AM

Hi, I am quite new, not sure how awk and sed can be used to preg match strings? how would I go about matching a string like:

Code:

Content-Disposition: attachment; filename=
	"=?big5?B?W8Opw6m2faTfpHDC7V9ieV9qYWNreWtvMjAwMl13aGl0ZSBhbGJ1bSAxNC0yNi50?=
 =?big5?Q?orrent?="
Content-Disposition: attachment; filename="=?big5?B?W8Opw6m2faTfpHDC7V9ieV9qYWNreWtvMjAwMl13aGl0ZSBhbGJ1bSAxNC0yNi50?=
 =?big5?Q?orrent?="

anything within the quotation mark beside the "Content-Disposition: attachment; filename=" string?
Thanks,
Ted

grail · 04-08-2011, 02:28 AM

Code:

sed -n '/^Content/s/[^"]+"|"$//p' file

Something like that ... might need '-r' as well.

ted_chou12 · 04-08-2011, 02:57 AM

Quote:

Originally Posted by grail

Code:

sed -n '/^Content/s/[^"]+"|"$//p' file

Something like that ... might need '-r' as well.

Thanks for your help, but I can't get that to work either. I've attempted to modify it, does this look more understandable:

Code:

echo $(cat "/var/mail/root/msg.9CT" | sed -r '/^Content-Disposition: attachment; filename=(*.?)/\1/p')

sh-3.1# /mnt/sda1/test.sh
sed: -e expression #1, char 50: Invalid preceding regular expression

Content-Disposition: attachment; filename= "=?big5?B?W8Opw6m2faTWl13i50?==?big5?Q?orrent?="
Green is the constant portion. Red is variable portion and the space inbetween the 'filename=' and the quotation mark '"' can be either a single space, a enter to a new line or several tabs?
Thanks
Ted

grail · 04-08-2011, 03:35 AM

Sorry .. my bad I just thought the line endings had been messed up on the paste

btw. this is the correct sed should it all be on one line:

Code:

sed -r -n '/^Content/s/^[^"]*"|"$//gp' file

So now that I know it is over multiple lines ... how do you want the data returned? ie. do you still want it over multiple lines or joined into a single entry?

jschiwal · 04-08-2011, 03:36 AM

Quote:

sed -r '/^Content-Disposition: attachment; filename=(*.?)/\1/p')

The first part: '/^Content-Disposition: attachment; filename=(*.?)/
matches a line. You're `s' command is missing a left hand side.
The "*.?" pattern doesn't make sense.
To match characters inside quotes, you could use:
"\([^"]*\)" or ".*"

However, your sample had the contents inside the quotes spread across 2 or 3 lines. Was that a mistake in copying to this post, or does it represent a real sample? Sed is a line editor. The input the LHS matches is in a line of input. If the input is in 2 or 3 lines, you need to build up more lines using the command "N" or "H".

Code:

sed -n '/Content-Disposition/{ /".*"/!N}
        /Content-Disposition/{ /".*"/!N}
        /Content-Disposition/{ /".*"/s/\n//g;p}' test
Content-Disposition: attachment; filename=      "=?big5?B?W8Opw6m2faTfpHDC7V9ieV9qYWNreWtvMjAwMl13aGl0ZSBhbGJ1bSAxNC0yNi50?= =?big5?Q?orrent?="
Content-Disposition: attachment; filename="=?big5?B?W8Opw6m2faTfpHDC7V9ieV9qYWNreWtvMjAwMl13aGl0ZSBhbGJ1bSAxNC0yNi50?= =?big5?Q?orrent?="

Here I am building up the input pattern by up to 3 lines if both quotes are not present in the the line. Then I removed the line feeds, joining the line.

If you just want the contents between the quotes you could use:
s/\n//g;s/.*\(".*"\).*/\1/p
in it's place.

This prints the contents without the quotes:

Code:

sed -n '/Content-Disposition/{ /".*"/!N}
        /Content-Disposition/{ /".*"/!N}
        /Content-Disposition/{ /".*"/s/\n//g;s/.*"\(.*\)"/\1/p}' test
=?big5?B?W8Opw6m2faTfpHDC7V9ieV9qYWNreWtvMjAwMl13aGl0ZSBhbGJ1bSAxNC0yNi50?= =?big5?Q?orrent?=
=?big5?B?W8Opw6m2faTfpHDC7V9ieV9qYWNreWtvMjAwMl13aGl0ZSBhbGJ1bSAxNC0yNi50?= =?big5?Q?orrent?=

Your sample contains the same contents in both samples. Is that what you wanted?

----

Knowing more information about the input pattern can help. For example, if you have records of lines separated by a blank line, things could be a lot easier.
eg:

Code:

/sbin/lspci -v | sed -n '/Network/,/^$/p'
14:00.0 Network controller: Atheros Communications Inc. AR928X Wireless Network Adapter (PCI-Express) (rev 01)
        Subsystem: Foxconn International, Inc. Device e009
        Flags: bus master, fast devsel, latency 0, IRQ 19
        Memory at f2100000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: <access denied>
        Kernel driver in use: ath9k

This allows matching a pattern for a range of lines, which could be operated on inside of brackets.

ted_chou12 · 04-08-2011, 03:54 AM

Quote:

Originally Posted by jschiwal

The first part: '/^Content-Disposition: attachment; filename=(*.?)/
matches a line. You're `s' command is missing a left hand side.
The "*.?" pattern doesn't make sense.
To match characters inside quotes, you could use:
"\([^"]*\)" or ".*"

However, your sample had the contents inside the quotes spread across 2 or 3 lines. Was that a mistake in copying to this post, or does it represent a real sample? Sed is a line editor. The input the LHS matches is in a line of input. If the input is in 2 or 3 lines, you need to build up more lines using the command "N" or "H".

Code:

sed -n '/Content-Disposition/{ /".*"/!N}
        /Content-Disposition/{ /".*"/!N}
        /Content-Disposition/{ /".*"/s/\n//g;p}' test
Content-Disposition: attachment; filename=      "=?big5?B?W8Opw6m2faTfpHDC7V9ieV9qYWNreWtvMjAwMl13aGl0ZSBhbGJ1bSAxNC0yNi50?= =?big5?Q?orrent?="
Content-Disposition: attachment; filename="=?big5?B?W8Opw6m2faTfpHDC7V9ieV9qYWNreWtvMjAwMl13aGl0ZSBhbGJ1bSAxNC0yNi50?= =?big5?Q?orrent?="

Here I am building up the input pattern by up to 3 lines if both quotes are not present in the the line. Then I removed the line feeds, joining the line.

If you just want the contents between the quotes you could use:
s/\n//g;s/.*\(".*"\).*/\1/p
in it's place.

This prints the contents without the quotes:

Code:

sed -n '/Content-Disposition/{ /".*"/!N}
        /Content-Disposition/{ /".*"/!N}
        /Content-Disposition/{ /".*"/s/\n//g;s/.*"\(.*\)"/\1/p}' test
=?big5?B?W8Opw6m2faTfpHDC7V9ieV9qYWNreWtvMjAwMl13aGl0ZSBhbGJ1bSAxNC0yNi50?= =?big5?Q?orrent?=
=?big5?B?W8Opw6m2faTfpHDC7V9ieV9qYWNreWtvMjAwMl13aGl0ZSBhbGJ1bSAxNC0yNi50?= =?big5?Q?orrent?=

Your sample contains the same contents in both samples. Is that what you wanted?

----

Knowing more information about the input pattern can help. For example, if you have records of lines separated by a blank line, things could be a lot easier.
eg:

Code:

/sbin/lspci -v | sed -n '/Network/,/^$/p'
14:00.0 Network controller: Atheros Communications Inc. AR928X Wireless Network Adapter (PCI-Express) (rev 01)
        Subsystem: Foxconn International, Inc. Device e009
        Flags: bus master, fast devsel, latency 0, IRQ 19
        Memory at f2100000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: <access denied>
        Kernel driver in use: ath9k

This allows matching a pattern for a range of lines, which could be operated on inside of brackets.

Thank you! jschiwal This was a very detailed explanation! Thanks to grail too.
Ted

grail · 04-08-2011, 04:00 AM

How about:

Code:

awk 'BEGIN{RS="[ \t\n]*\"[ \t\n]*"}/^Content/{getline;print}' file

Edit: Also you can put all the output on one line for each one like so:

Code:

awk 'BEGIN{RS="[ \t\n]*\"[ \t\n]*"}/^Content/{getline;gsub(/[[:space:]]/,"");print}' file

ted_chou12 · 04-08-2011, 04:27 AM

Thanks grail! This works awesome!