BASH: Write only unique strings to text file (cat or while read question)
Here's what I have:
A text file, 6000 lines long, with file names and strings of different lengths delimited by commas: one file path to one "identifier" string. Here's what I want to do: Write a script that will pass the first unique identifier string to a text file every time it comes across one that's different from the one, two or however many come before it, to another text file. Along with this ID should be the path string with which it was first encountered. This way, when the text file is viewed, one can tell at a glance, just to give an example, that gt7900-56262.jpg and gtxmas-a1001.jpg both come from Getty Images. Is this possible in BASH, or should I consider some more robust scripting language/environment? Or maybe a GUI app like OpenOffice Calc? BZT |
So, you've got a file with 2 columns of data, like:
Code:
key1 key2 Check the manpages for both `sort` and `uniq` to be sure the options I suggest are suitable for your intent. Plus, I have a feeling you really should show us a sample from the real file, and explain a little more... |
Also awk could handle this easily as well. As post above seeing some input would help. Plus what have you tried?
|
I've tried nothing as yet. I'm looking for "the best way to go about it" rather than follow the same progression as I did for the "zip-on-the-fly" one.
Here are a few snippets stitched together from the list file: Code:
4q2-11084.jpg,4q2 BZT |
Code:
awk -F"," '{a[$2]=a[$2]" "$1}END{for(i in a)print i,a[i]}' file |
This is in Perl, not Bash, but it gives you an idea of the logic:
Code:
#!/usr/bin/env perl |
Quote:
I'm wondering if there is any way we could fine-tune the output so it doesn't include such subtle changes as a "gae39-3926-406-009.jpg" to a "gae39-3926-409-003.jpg". If, in order to do so, some parts of the big original list of filenames and credits/sources have to be broken down by 'major provider,' then I'm not averse to approaching it that way. This one source garnered 6,226 out of 28,000+ lines in the master file. BZT |
I think you will need to provide more information?
For example what you consider to be "subtle changes" seem to be a personal preference, which is fine, but probably only you can then tell what they are?? I mean you could tell awk to just look at the first 15 characters (for example), but then if this constantly changes depending on the length of the string it will obviously become more complex. Plus it appears you have also now changed the parameters of this query as initially you only had a 6000 line file which is now over 28000 |
Quote:
So let's hang the fine distinctions and get back to checking whether or not the awk command ghostdog74 suggested does indeed return the first of all the changes. I think it does, but I'll have to double-check. What I pictured a script doing to find and match "first uniques" was to print a text file similarly formatted to the input file: an A column of sources and a B column with the first file found that matched that 'source' string. No big deal, though: I edited and sorted the 6K+ list using OpenOffice Calc; I could easily do the same with any shorter list. If awk came across, say, an "Adult Empire sites" second-column entry, with a file name like dylma1-158-01.jpg and the next distinctively different one that matched that source string was gae-015-002.jpg, then the way it passed the filenames under the "heading" of "Adult Empire sites" would make it easy to recognize that dylma and gae were at the 'heads' of filenames whose files had in them the IPTC Credit tag of "Adult Empire sites." Therefore I could, when I made up the list of 'heads' to use when EXIV2 went to write Credit tags to new files, if it found either a dylma1, -2, or -3, or a gae, then it would know which Credit string to use. But getting back to this stage in the process: At this point, I'd like to acknowledge all the help so far given me with this task. It might just be that involving bash and the command line to do any more than format lists I can afterward edit by hand is akin to keeping the nurse around after Grandma's recovered from her flu. If so, then I'll stop back to this thread when I come across something else I'm either vaguely or positively aware bash can help with in a "powerful" way unique to itself. BZT |
Its better to show your expected output given the input file you provided, and explain clearly along with those data you provide instead of writing long essays like that. For me, when i look at long essays like that, I skip.
|
Well I could be wrong, and I do agree with ghostdog that sometimes too long an explanation can kill enthusiasm to look at query, but it appears that
if you use ghostdog's suggestion and then maybe look at pairing down by all text up to the first hyphen, maybe? |
Quote:
BZT |
Quote:
A line in the input file looks like this: Code:
gae41-4122-020-017.jpg,Adult Empire sites Code:
gae41-4127-014-210.jpg,Adult Empire sites Code:
gae41:Adult Empire sites Just FYI: doing a Code:
>> cat empire | grep foo | wc -l It looks as though grail's suggestion, paring down to a hyphen, would work for the Adult Empire sites "sub-list" (6000 files). For a script meant to tackle the list of the whole 28000+, where not every change is signaled by a hyphen, it almost definitely would not work. Now I think I should forget about such a script, split the big list into smaller ones (with alpha ranges like #-#, a-f, g-h, i-n and o-z by filename instead of source) and puzzle out common patterns to apply to them in their own scripts. BZT |
Quote:
awk is very powerful as displayed by ghostdog and has many other options for manipulation. Also there are the perlites who will tell you of extraordinary things they can do with string manipulation (i am still learning :) ) One thought I have had, loosely based on ghostdog's script, which may help to identify a smaller list to work with is to get one item from every column B so as to pare down the possible list of delimeters (eg we already have hyphen as, hopefully, being the delimeter to use for Adult Empire) So try the following and see how many lines are returned: Code:
awk -F, '!a[$2]++{b++}END{print b}' file Assuming not 1000's you can then print the unique lines with the following into another file: Code:
awk -F, '!a[$2]++' in_file > out_file |
Quote:
The main reason I haven't involved that file as yet is that, like I described in my previous post with regard to the gae's, most but not all of them follow discernible file name patterns while "Adult Empire sites" is only listed once in the Credit list. I have a vague idea that it's something like this for a few others in the Credit list. Altogether, it's a tough go trying to get all those "heads" into one "hat" outside of a GUI. :) BZT |
All times are GMT -5. The time now is 08:53 PM. |