How-to cut specific text from cell and paste into new column
I have a CSV file with data in a column that contains the product name, size, color (but sometimes it may or may not contain the size and or color). I need to get the size and color (if available) into their own column.
I'm using a shell script to manipulate the file. I tried using sed and a file that consists of lines like: "s/\bBLACK\b/,Black,/g" in order to create a new column, but that screws up the file because not all items contain color and/or size information. Right now I added two empty columns to the original file. If I had a list of colors and sizes (which I do have), whats the best way to cut or move the color matches into one of the new columns, and the size matches into the other? What is the best approach, any recommendations? Sample cell: HELMET GP-TECH BLACK XL Sample Result: HELMET GP-TECH,BLACK,XL |
You may need to be more specific about your data. For example, this sed script works with the example data you provided:
Code:
user@localhost:~$ echo "HELMET GP-TECH BLACK XL" | sed 's@\(.*\) \(.*\) \(.*\)@\1,\2,\3@' 1. The line must contain at least two spaces 2. The last two spaces are used to separate the color and size information This obviously will not work if your color or size options contain spaces themselves. EDIT: In other words, it would be useful to see some real, full-fledged sample data (any confidential info "scrubbed" if necessary). Otherwise, we can't test a solution on our end to make sure it matches what you need. EDIT2: Just as another example, this might work (which shows additional data that could be part of the CSV): Code:
user@localhost:~$ echo "HELMET GP-TECH BLACK XL,dummy data 1,dummy data 2" | sed 's@\([^,]*\) \([^,]*\) \([^,]*\)@\1,\2,\3@' |
Here is some sample data. The problem is that the column I'm working with, has no set pattern. Colors and Sizes are not necessarily in the same order, and colors can be combos (blk/wht) and not necessarily spelled the same, some items have sizes only and no colors, and so on and so forth. I'd also like to rename the colors and sizes to something standard. So like I mentioned before, I have a file with all the possible color and size variables that I'm looking for.
Thanks for your help! Code:
dummydata1,dummydata2,HELMET GP-TECH BLACK S,dummydata3,dummydata4 Code:
0101-1234,HELMET GP-TECH BLACK M,AGV SPA - ITALY,AGV,59995,55,0 |
If I understand what you're describing, then the typical automated text tools (grep, cut, sed, etc.) won't handle your situation. You'll probably need something like (g)awk.
So, just to make sure, you're saying: 1. "Colors and Sizes are not necessarily in the same order" --> meaning line 5 might need column 2 placed after column 4, but line 8 might need no column adjustment 2. "colors can be combos (blk/wht) and not necessarily spelled the same" --> meaning there is no consistency among column "values" (which would prevent an easy identification that a specific cell was a size/color/something else) 3. "some items have sizes only and no colors" --> meaning some lines might require adding empty "placeholders" to have each line contain a consistent number of cells So, if that's correct, then I have good news and bad news. :) The bad news is: (g)awk is probably the best approach if you want to handle all the manipulations in one shot. The reason that is bad news is because (1) I'm not that experienced with it and wouldn't be much help and (2) that there's a bit of a learning curve especially if you come to it from a non-programming background. For reference, here's The GNU Awk User's Guide. It's a "light" read... Though, there are some (g)awk gurus lurking the forums. The good news is: I can help you tackle it piece-meal if you like. Each component above should be possible to fix one-at-a-time. Doing so might take a while going back-and-forth. So we might have to take the discussion off the forum an over to email. |
Using the data from post #3, would you please also provide (I realise manually) the output of that data?
Also, are you able to confirm if the section of data in the csv we are looking at is always in the same column? I ask as the 2 examples in post #3 have it in differing columns, ie first data in column 3 but second in column 2. I agree with DarkH that awk is probably one of the easier solutions. If we use first data in post #3 we would be looking at something like: Code:
awk -F, '{n=split($3, arr, " ")}' file.csv be easily worked on to see what is in each item of the array. I will await feedback to provide more to the solution :) |
Quote:
Code:
dummydata1,dummydata2,HELMET GP-TECH,BLACK,S,dummydata3,dummydata4 Quote:
Quote:
|
All times are GMT -5. The time now is 04:25 AM. |