LinuxQuestions.org - [SOLVED] BASH: Write only unique strings to text file (cat or while read question)

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - BASH: Write only unique strings to text file (cat or while read question) (https://www.linuxquestions.org/questions/programming-9/bash-write-only-unique-strings-to-text-file-cat-or-while-read-question-822031/)

BASH: Write only unique strings to text file (cat or while read question)

Here's what I have:
A text file, 6000 lines long, with file names and strings of different lengths delimited by commas: one file path to one "identifier" string.

Here's what I want to do:
Write a script that will pass the first unique identifier string to a text file every time it comes across one that's different from the one, two or however many come before it, to another text file. Along with this ID should be the path string with which it was first encountered. This way, when the text file is viewed, one can tell at a glance, just to give an example, that gt7900-56262.jpg and gtxmas-a1001.jpg both come from Getty Images.

Is this possible in BASH, or should I consider some more robust scripting language/environment? Or maybe a GUI app like OpenOffice Calc?

BZT

So, you've got a file with 2 columns of data, like:

Code:

key1 key2

key1 key2

key1 keys

and you want all lines with unique key1 columns, to be put into a new file? If not, please show us a snippet of the actual file and explain further. But if so, then you can easily do this using `sort -k1 -V filename | uniq -u > newfile`

Check the manpages for both `sort` and `uniq` to be sure the options I suggest are suitable for your intent. Plus, I have a feeling you really should show us a sample from the real file, and explain a little more...

Also awk could handle this easily as well. As post above seeing some input would help. Plus what have you tried?

I've tried nothing as yet. I'm looking for "the best way to go about it" rather than follow the same progression as I did for the "zip-on-the-fly" one.

Here are a few snippets stitched together from the list file:

Code:

4q2-11084.jpg,4q2

4q2-11094.jpg,4q2

4q2-11106.jpg,4q2

4q2-11108.jpg,4q2

4q2-11121.jpg,4q2

4q2-13020.jpg,4q2

4q2-13145.jpg,4q2

4q2-14096.jpg,4q2

New Eri Sakurai Desktop.jpg,bztsilversleeves

New Eri Sakurai Wallpaper.jpg,bztsilversleeves

newlinuxdt.jpg,bztsilversleeves

amatbikini-girls0441-05.jpg,CandidBeach.com

beach-candids-005-003.jpg,CandidBeach.com

beach-candids-005-006.jpg,CandidBeach.com

10kylie01.jpg,Dixiecuties

10rosalie02.jpg,Dixiecuties

Homegrown0038-048.jpg,EagleOne

Homegrown0038-067.jpg,EagleOne

Homegrown0038-068.jpg,EagleOne

Homegrown0038-069.jpg,EagleOne

Homegrown0038-070.jpg,EagleOne

Homegrown0038-078.jpg,EagleOne

Homegrown0038-080.jpg,EagleOne

Homegrown0039-037.jpg,EagleOne

Homegrown0039-038.jpg,EagleOne

Homegrown0039-039.jpg,EagleOne

Homegrown0039-040.jpg,EagleOne

Homegrown0039-041.jpg,EagleOne

Homegrown0039-094.jpg,EagleOne

Homegrown0041-046.jpg,EagleOne

Homegrown0041-047.jpg,EagleOne

Homegrown0041-048.jpg,EagleOne

Homegrown0041-049.jpg,EagleOne

Homegrown0041-051.jpg,EagleOne

Homegrown0041-052.jpg,EagleOne

Homegrown0041-053.jpg,EagleOne

Homegrown0041-138.jpg,EagleOne

Homegrown0042-004.jpg,EagleOne

Homegrown0042-111.jpg,EagleOne

Homegrown0042-112.jpg,EagleOne

fg-0125-18-003.jpg,FynfuTnat.com

fg-0128-36-001.jpg,FynfuTnat.com

fynfutat-zneqvtenf011.jpg,FynfuTnat.com

fynfutat-zneqvtenf012.jpg,FynfuTnat.com

fynfutat-zneqvtenf015.jpg,FynfuTnat.com

So with a glance or two at this list, which is much shorter (as I mentioned in my OP) than the one I mean for the script to work its way down, but still gives a good sense of how many lines there are between one unique identifier in the second column and the next one (arbitrary, to say the least, in terms of number and length of line/sub-string), what would you recommend:a sort<>uniq combo or awk?

BZT

Code:

awk -F"," '{a[$2]=a[$2]" "$1}END{for(i in a)print i,a[i]}' file

This is in Perl, not Bash, but it gives you an idea of the logic:

Code:

#!/usr/bin/env perl

use strict;

use warnings;



my $prev_id = '';



while (<>) {

    chomp;

    my($photo, $id) = split ',';

    if ($id ne $prev_id) {

        print "$_\n";

        $prev_id = $id;

    }

}

You maintain state with $prev_id. Every time it changes, you print the whole line and change $prev_id to the new $id. Otherwise, you keep going.

Quote:

Originally Posted by ghostdog74 (Post 4044774)

Code:

awk -F"," '{a[$2]=a[$2]" "$1}END{for(i in a)print i,a[i]}' file

I tried this on a text file that was just the patched-together snippets of the big file (see previous post). I like what I see. The credit column names come first, and then all the file names that match that credit column string that are at least a little bit different. Just now I tried it on a c&p of the lines out of the master list corresponding to the one source for which I have the most files (and the most changes in filename pattern). I had bash dump the awk results to a text file, and when it was finished I opened the file and found there were mostly files with only very minor differences in name between them.

I'm wondering if there is any way we could fine-tune the output so it doesn't include such subtle changes as a "gae39-3926-406-009.jpg" to a "gae39-3926-409-003.jpg". If, in order to do so, some parts of the big original list of filenames and credits/sources have to be broken down by 'major provider,' then I'm not averse to approaching it that way. This one source garnered 6,226 out of 28,000+ lines in the master file.

BZT

I think you will need to provide more information?

For example what you consider to be "subtle changes" seem to be a personal preference, which is fine, but probably only you can then tell what they are??

I mean you could tell awk to just look at the first 15 characters (for example), but then if this constantly changes depending on the length of the string
it will obviously become more complex.

Plus it appears you have also now changed the parameters of this query as initially you only had a 6000 line file which is now over 28000

Quote:

Originally Posted by grail (Post 4045232)

I thought of all of the above in the intervening hours. I think I may have got my wires crossed when I made the OP to this thread. 6000 was the number of files I was having copied to an SD card via a reader and iView Media Pro from a catalog that just happened to comprise those pics from that single-most-numerous source. The list of just those files, while it does have quite a few files listed that would preferably be more finely distinguished from one another (like the example I gave in my previous post), it also has more than a few with totally-unique names -- some just numbers and a file extension, some pretty explicitly-descriptive of the content of the pics. I hinted at breaking down the list of 28K+ names, but just for the list of the 6K+, that would likewise have to be broken down to isolate the very similar ones from the not-at-all alike ones, so far as filenames go. I think I made mention of this "formidable obstacle" in my blog post regarding this project.

So let's hang the fine distinctions and get back to checking whether or not the awk command ghostdog74 suggested does indeed return the first of all the changes. I think it does, but I'll have to double-check.

What I pictured a script doing to find and match "first uniques" was to print a text file similarly formatted to the input file: an A column of sources and a B column with the first file found that matched that 'source' string. No big deal, though: I edited and sorted the 6K+ list using OpenOffice Calc; I could easily do the same with any shorter list.

If awk came across, say, an "Adult Empire sites" second-column entry, with a file name like dylma1-158-01.jpg and the next distinctively different one that matched that source string was gae-015-002.jpg, then the way it passed the filenames under the "heading" of "Adult Empire sites" would make it easy to recognize that dylma and gae were at the 'heads' of filenames whose files had in them the IPTC Credit tag of "Adult Empire sites." Therefore I could, when I made up the list of 'heads' to use when EXIV2 went to write Credit tags to new files, if it found either a dylma1, -2, or -3, or a gae, then it would know which Credit string to use.

But getting back to this stage in the process:
At this point, I'd like to acknowledge all the help so far given me with this task. It might just be that involving bash and the command line to do any more than format lists I can afterward edit by hand is akin to keeping the nurse around after Grandma's recovered from her flu. If so, then I'll stop back to this thread when I come across something else I'm either vaguely or positively aware bash can help with in a "powerful" way unique to itself.

BZT

Its better to show your expected output given the input file you provided, and explain clearly along with those data you provide instead of writing long essays like that. For me, when i look at long essays like that, I skip.

Well I could be wrong, and I do agree with ghostdog that sometimes too long an explanation can kill enthusiasm to look at query, but it appears that
if you use ghostdog's suggestion and then maybe look at pairing down by all text up to the first hyphen, maybe?

Quote:

Originally Posted by grail (Post 4045232)

Plus it appears you have also now changed the parameters of this query as initially you only had a 6000 line file which is now over 28000

Just call me the quasi-n00b who, like the "Man" in that Irish-made movie, "went up a hill and came down a mountain." :)

BZT

Quote:

Originally Posted by ghostdog74 (Post 4045377)

Okay, here it all is in a nutshell.

A line in the input file looks like this:

Code:

gae41-4122-020-017.jpg,Adult Empire sites

The very next line in that same input file reads:

Code:

gae41-4127-014-210.jpg,Adult Empire sites

How I'd like the new list generated by the script is for these two lines (and others like them) to read:

Code:

gae41:Adult Empire sites

That way I can write a script that tells Exiv2 that when it comes across any new file starting with gae41, or dylma1, or myfpa1 (all Adult Empire "heads" of file names), to write an IPTC Credit tag to the new file that reads "Adult Empire sites" (the string in the B column, as GrapefruiTgirl kinda-sorta called it). A "smart list" of B-column sources would have only each change between, for instance, a set of gae50- files and another of gae51- files.

Just FYI: doing a

Code:

>> cat empire | grep foo | wc -l

with three different "foo" substitutions, I discovered just now that there are 5921 files with names that start with the three letters "gae", 176 that start with some variation of "myfpa," but only 14 that start with plays on "dylma".This means there are 6111 files with this IPTC Credit tag following some kind of name pattern; the rest of the file names are either all-number, some arbitrary letter-number combination or a string of letters/words.

It looks as though grail's suggestion, paring down to a hyphen, would work for the Adult Empire sites "sub-list" (6000 files). For a script meant to tackle the list of the whole 28000+, where not every change is signaled by a hyphen, it almost definitely would not work. Now I think I should forget about such a script, split the big list into smaller ones (with alpha ranges like #-#, a-f, g-h, i-n and o-z by filename instead of source) and puzzle out common patterns to apply to them in their own scripts.

BZT

Quote:

For a script meant to tackle the list of the whole 28000+, where not every change is signaled by a hyphen

Well this would depend on how many differing items there are (ie other than hyphen).
awk is very powerful as displayed by ghostdog and has many other options for manipulation. Also there are the perlites who will tell you of
extraordinary things they can do with string manipulation (i am still learning :) )

One thought I have had, loosely based on ghostdog's script, which may help to identify a smaller list to work with is to get one item from every
column B so as to pare down the possible list of delimeters (eg we already have hyphen as, hopefully, being the delimeter to use for Adult Empire)

So try the following and see how many lines are returned:

Code:

awk -F, '!a[$2]++{b++}END{print b}' file

This will tell you how many unique column B's you have.

Assuming not 1000's you can then print the unique lines with the following into another file:

Code:

awk -F, '!a[$2]++' in_file > out_file

You do not have to go this road, just thought I would give you an option :)

Quote:

Originally Posted by grail (Post 4046733)

One thought I have had, loosely based on ghostdog's script, which may help to identify a smaller list to work with is to get one item from every column B so as to pare down the possible list of delimiters (eg we already have hyphen as, hopefully, being the delimiter to use for Adult Empire)

I kind of have that already. iView MP/MS Expression Media 2 keep their collated tags alphabetized in separate lists (UTF8 with DOS line feeds but otherwise easily-convertible text files). There's one for Credit that I've already made ASCII/Unix, and as you can save in "profile" bundles, the Credits listed in that file for the profile under which I most often create catalogs correspond nearly 1:1 with the B-column strings in the 28000+-files list. Maybe counting and cross-referencing the B-column stuff to the UNIX-ified version of the iView list of Credits should be a next step?

The main reason I haven't involved that file as yet is that, like I described in my previous post with regard to the gae's, most but not all of them follow discernible file name patterns while "Adult Empire sites" is only listed once in the Credit list. I have a vague idea that it's something like this for a few others in the Credit list. Altogether, it's a tough go trying to get all those "heads" into one "hat" outside of a GUI. :)

BZT