How-to cut specific text from cell and paste into new column
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
How-to cut specific text from cell and paste into new column
I have a CSV file with data in a column that contains the product name, size, color (but sometimes it may or may not contain the size and or color). I need to get the size and color (if available) into their own column.
I'm using a shell script to manipulate the file. I tried using sed and a file that consists of lines like: "s/\bBLACK\b/,Black,/g" in order to create a new column, but that screws up the file because not all items contain color and/or size information.
Right now I added two empty columns to the original file. If I had a list of colors and sizes (which I do have), whats the best way to cut or move the color matches into one of the new columns, and the size matches into the other?
You may need to be more specific about your data. For example, this sed script works with the example data you provided:
Code:
user@localhost:~$ echo "HELMET GP-TECH BLACK XL" | sed 's@\(.*\) \(.*\) \(.*\)@\1,\2,\3@'
HELMET GP-TECH,BLACK,XL
user@localhost:~$
The limitations are:
1. The line must contain at least two spaces
2. The last two spaces are used to separate the color and size information
This obviously will not work if your color or size options contain spaces themselves.
EDIT:
In other words, it would be useful to see some real, full-fledged sample data (any confidential info "scrubbed" if necessary). Otherwise, we can't test a solution on our end to make sure it matches what you need.
EDIT2:
Just as another example, this might work (which shows additional data that could be part of the CSV):
Code:
user@localhost:~$ echo "HELMET GP-TECH BLACK XL,dummy data 1,dummy data 2" | sed 's@\([^,]*\) \([^,]*\) \([^,]*\)@\1,\2,\3@'
HELMET GP-TECH,BLACK,XL,dummy data 1,dummy data 2
user@localhost:~$
Last edited by Dark_Helmet; 12-13-2011 at 10:04 PM.
Here is some sample data. The problem is that the column I'm working with, has no set pattern. Colors and Sizes are not necessarily in the same order, and colors can be combos (blk/wht) and not necessarily spelled the same, some items have sizes only and no colors, and so on and so forth. I'd also like to rename the colors and sizes to something standard. So like I mentioned before, I have a file with all the possible color and size variables that I'm looking for.
Thanks for your help!
Code:
dummydata1,dummydata2,HELMET GP-TECH BLACK S,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH BLACK M,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH BLACK L,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH 5-CONT SM,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH 5-CONT MD,dummydata3,dummydata4
dummydata1,dummydata2,HELMET GP-TECH 5-CONT LG,dummydata3,dummydata4
dummydata1,dummydata2,T2 LEVER BUCKLE-PAIR,dummydata3,dummydata4
dummydata1,dummydata2,TECH 4 BUCKLE SET BLACK,dummydata3,dummydata4
dummydata1,dummydata2,TECH 4 BUCKLE SET BLK/WHT,dummydata3,dummydata4
dummydata1,dummydata2,TECH6S BUCKLE SET-BLACK,dummydata3,dummydata4
dummydata1,dummydata2,TECH6S BUCKLE SET-WHITE,dummydata3,dummydata4
dummydata1,dummydata2,T-6 BUCKL/STRAP SET BK/BK,dummydata3,dummydata4
dummydata1,dummydata2,T-6 BUCKL/STRAP SET BK/WT,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.BLK/BLK L,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.BLK/BLK M,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.BLK/BLK S,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.BLK/BLK XL,dummydata3,dummydata4
dummydata1,dummydata2,ST-1 GLV.BLK/BLK XXL,dummydata3,dummydata4
Here is some data from the other columns:
Code:
0101-1234,HELMET GP-TECH BLACK M,AGV SPA - ITALY,AGV,59995,55,0
0101-1235,HELMET GP-TECH BLACK L,AGV SPA - ITALY,AGV,59995,55,0
0101-1236,HELMET GP-TECH BLACK XL,AGV SPA - ITALY,AGV,59995,55,7
0101-1237,HELMET GP-TECH BLACK XXL,AGV SPA - ITALY,AGV,59995,54,5
0101-1238,HELMET GP-TECH BLACK 3XL,AGV SPA - ITALY,AGV,59995,54,0
If I understand what you're describing, then the typical automated text tools (grep, cut, sed, etc.) won't handle your situation. You'll probably need something like (g)awk.
So, just to make sure, you're saying:
1. "Colors and Sizes are not necessarily in the same order" --> meaning line 5 might need column 2 placed after column 4, but line 8 might need no column adjustment
2. "colors can be combos (blk/wht) and not necessarily spelled the same" --> meaning there is no consistency among column "values" (which would prevent an easy identification that a specific cell was a size/color/something else)
3. "some items have sizes only and no colors" --> meaning some lines might require adding empty "placeholders" to have each line contain a consistent number of cells
So, if that's correct, then I have good news and bad news.
The bad news is:
(g)awk is probably the best approach if you want to handle all the manipulations in one shot. The reason that is bad news is because (1) I'm not that experienced with it and wouldn't be much help and (2) that there's a bit of a learning curve especially if you come to it from a non-programming background. For reference, here's The GNU Awk User's Guide. It's a "light" read... Though, there are some (g)awk gurus lurking the forums.
The good news is:
I can help you tackle it piece-meal if you like. Each component above should be possible to fix one-at-a-time. Doing so might take a while going back-and-forth. So we might have to take the discussion off the forum an over to email.
Using the data from post #3, would you please also provide (I realise manually) the output of that data?
Also, are you able to confirm if the section of data in the csv we are looking at is always in the same column?
I ask as the 2 examples in post #3 have it in differing columns, ie first data in column 3 but second in column 2.
I agree with DarkH that awk is probably one of the easier solutions. If we use first data in post #3 we would be looking
at something like:
Code:
awk -F, '{n=split($3, arr, " ")}' file.csv
On it's own this does nothing but it will place all the items we are interested in into an array called 'arr' which then can
be easily worked on to see what is in each item of the array.
I will await feedback to provide more to the solution
Using the data from post #3, would you please also provide (I realise manually) the output of that data?
I need to add two columns, one for color, and another for size. In this example, if I find a color or size in the 3rd column, move it to the respective column. If nothing is found/matched, the two columns would remain with no data in them.
Also, are you able to confirm if the section of data in the csv we are looking at is always in the same column?
I ask as the 2 examples in post #3 have it in differing columns, ie first data in column 3 but second in column 2.
Yes, data will always be in the same column; sorry - wasn't consistent.
Quote:
Originally Posted by grail
I agree with DarkH that awk is probably one of the easier solutions. If we use first data in post #3 we would be looking
at something like:
Code:
awk -F, '{n=split($3, arr, " ")}' file.csv
On it's own this does nothing but it will place all the items we are interested in into an array called 'arr' which then can
be easily worked on to see what is in each item of the array.
I will await feedback to provide more to the solution
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.