How-to cut specific text from cell and paste into new column

ivn · 12-13-2011, 09:48 PM

I have a CSV file with data in a column that contains the product name, size, color (but sometimes it may or may not contain the size and or color). I need to get the size and color (if available) into their own column.

I'm using a shell script to manipulate the file. I tried using sed and a file that consists of lines like: "s/\bBLACK\b/,Black,/g" in order to create a new column, but that screws up the file because not all items contain color and/or size information.

Right now I added two empty columns to the original file. If I had a list of colors and sizes (which I do have), whats the best way to cut or move the color matches into one of the new columns, and the size matches into the other?

What is the best approach, any recommendations?

Sample cell:
HELMET GP-TECH BLACK XL

Sample Result:
HELMET GP-TECH,BLACK,XL

Dark_Helmet · 12-13-2011, 09:56 PM

You may need to be more specific about your data. For example, this sed script works with the example data you provided:

Code:

user@localhost:~$ echo "HELMET GP-TECH BLACK XL" | sed 's@\(.*\) \(.*\) \(.*\)@\1,\2,\3@'
HELMET GP-TECH,BLACK,XL
user@localhost:~$

The limitations are:
1. The line must contain at least two spaces
2. The last two spaces are used to separate the color and size information

This obviously will not work if your color or size options contain spaces themselves.

EDIT:
In other words, it would be useful to see some real, full-fledged sample data (any confidential info "scrubbed" if necessary). Otherwise, we can't test a solution on our end to make sure it matches what you need.

EDIT2:
Just as another example, this might work (which shows additional data that could be part of the CSV):

Code:

user@localhost:~$ echo "HELMET GP-TECH BLACK XL,dummy data 1,dummy data 2" | sed 's@\([^,]*\) \([^,]*\) \([^,]*\)@\1,\2,\3@'
HELMET GP-TECH,BLACK,XL,dummy data 1,dummy data 2
user@localhost:~$

ivn · 12-14-2011, 08:35 PM

Here is some sample data. The problem is that the column I'm working with, has no set pattern. Colors and Sizes are not necessarily in the same order, and colors can be combos (blk/wht) and not necessarily spelled the same, some items have sizes only and no colors, and so on and so forth. I'd also like to rename the colors and sizes to something standard. So like I mentioned before, I have a file with all the possible color and size variables that I'm looking for.

Thanks for your help!

Code:

dummydata1,dummydata2,HELMET GP-TECH BLACK S,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH BLACK M,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH BLACK L,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH 5-CONT SM,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH 5-CONT MD,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH 5-CONT LG,dummydata3,dummydata4
dummydata1,dummydata2,T2 LEVER BUCKLE-PAIR,dummydata3,dummydata4
dummydata1,dummydata2,TECH 4 BUCKLE SET  BLACK,dummydata3,dummydata4
dummydata1,dummydata2,TECH 4 BUCKLE SET BLK/WHT,dummydata3,dummydata4
dummydata1,dummydata2,TECH6S BUCKLE SET-BLACK,dummydata3,dummydata4
dummydata1,dummydata2,TECH6S BUCKLE SET-WHITE,dummydata3,dummydata4
dummydata1,dummydata2,T-6 BUCKL/STRAP SET BK/BK,dummydata3,dummydata4
dummydata1,dummydata2,T-6 BUCKL/STRAP SET BK/WT,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.BLK/BLK L,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.BLK/BLK M,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.BLK/BLK S,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.BLK/BLK XL,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.BLK/BLK XXL,dummydata3,dummydata4

Here is some data from the other columns:

Code:

0101-1234,HELMET GP-TECH BLACK M,AGV SPA - ITALY,AGV,59995,55,0
0101-1235,HELMET GP-TECH BLACK L,AGV SPA - ITALY,AGV,59995,55,0
0101-1236,HELMET GP-TECH BLACK XL,AGV SPA - ITALY,AGV,59995,55,7
0101-1237,HELMET GP-TECH BLACK XXL,AGV SPA - ITALY,AGV,59995,54,5
0101-1238,HELMET GP-TECH BLACK 3XL,AGV SPA - ITALY,AGV,59995,54,0

Dark_Helmet · 12-14-2011, 09:12 PM

If I understand what you're describing, then the typical automated text tools (grep, cut, sed, etc.) won't handle your situation. You'll probably need something like (g)awk.

So, just to make sure, you're saying:
1. "Colors and Sizes are not necessarily in the same order" --> meaning line 5 might need column 2 placed after column 4, but line 8 might need no column adjustment
2. "colors can be combos (blk/wht) and not necessarily spelled the same" --> meaning there is no consistency among column "values" (which would prevent an easy identification that a specific cell was a size/color/something else)
3. "some items have sizes only and no colors" --> meaning some lines might require adding empty "placeholders" to have each line contain a consistent number of cells

So, if that's correct, then I have good news and bad news.

The bad news is:
(g)awk is probably the best approach if you want to handle all the manipulations in one shot. The reason that is bad news is because (1) I'm not that experienced with it and wouldn't be much help and (2) that there's a bit of a learning curve especially if you come to it from a non-programming background. For reference, here's The GNU Awk User's Guide. It's a "light" read... Though, there are some (g)awk gurus lurking the forums.

The good news is:
I can help you tackle it piece-meal if you like. Each component above should be possible to fix one-at-a-time. Doing so might take a while going back-and-forth. So we might have to take the discussion off the forum an over to email.

grail · 12-14-2011, 10:49 PM

Using the data from post #3, would you please also provide (I realise manually) the output of that data?

Also, are you able to confirm if the section of data in the csv we are looking at is always in the same column?
I ask as the 2 examples in post #3 have it in differing columns, ie first data in column 3 but second in column 2.

I agree with DarkH that awk is probably one of the easier solutions. If we use first data in post #3 we would be looking
at something like:

Code:

awk -F, '{n=split($3, arr, " ")}' file.csv

On it's own this does nothing but it will place all the items we are interested in into an array called 'arr' which then can
be easily worked on to see what is in each item of the array.

I will await feedback to provide more to the solution

ivn · 12-17-2011, 08:53 PM

Quote:

Originally Posted by grail

Using the data from post #3, would you please also provide (I realise manually) the output of that data?

I need to add two columns, one for color, and another for size. In this example, if I find a color or size in the 3rd column, move it to the respective column. If nothing is found/matched, the two columns would remain with no data in them.

Code:

dummydata1,dummydata2,HELMET GP-TECH,BLACK,S,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH,BLACK,M,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH,BLACK,L,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH 5-CONT,,SM,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH 5-CONT,,MD,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH 5-CONT,,LG,dummydata3,dummydata4
dummydata1,dummydata2,T2 LEVER BUCKLE-PAIR,,,dummydata3,dummydata4
dummydata1,dummydata2,TECH 4 BUCKLE SET,BLACK,,dummydata3,dummydata4
dummydata1,dummydata2,TECH 4 BUCKLE SET,BLK/WHT,,dummydata3,dummydata4
dummydata1,dummydata2,TECH6S BUCKLE SET-,BLACK,,dummydata3,dummydata4
dummydata1,dummydata2,TECH6S BUCKLE SET-,WHITE,,dummydata3,dummydata4
dummydata1,dummydata2,T-6 BUCKL/STRAP SET,BK/BK,,dummydata3,dummydata4
dummydata1,dummydata2,T-6 BUCKL/STRAP SET,BK/WT,,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.,BLK/BLK,L,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.,BLK/BLK,M,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.,BLK/BLK,S,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.,BLK/BLK,XL,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.,BLK/BLK,XXL,dummydata3,dummydata4

Quote:

Originally Posted by grail

Also, are you able to confirm if the section of data in the csv we are looking at is always in the same column?
I ask as the 2 examples in post #3 have it in differing columns, ie first data in column 3 but second in column 2.

Yes, data will always be in the same column; sorry - wasn't consistent.

Quote:

Originally Posted by grail

I agree with DarkH that awk is probably one of the easier solutions. If we use first data in post #3 we would be looking
at something like:

Code:

awk -F, '{n=split($3, arr, " ")}' file.csv

On it's own this does nothing but it will place all the items we are interested in into an array called 'arr' which then can
be easily worked on to see what is in each item of the array.

I will await feedback to provide more to the solution

What would be the next step(s)?