[SOLVED] AWKward question: Splitting files, joining some other files to them and then saving the results ...

yakkmeister · 03-10-2017, 07:47 PM

Hello everyperson!

I've been working on a few problems at my job but one has been exceptionally difficult for me to solve ... mostly due to the arcane rites required to make awk do something.
You can read that as "has no clue about how awk syntax work" :P

So here's what I've got to get working ...

I have 1 csv file with anything from 1-1m userID numbers. I can ensure this file contains only 1 column if need be (and I'm pretty sure we're doing it now ... ) so I can be certain of the column, solutions can definitely make use of that.
I *don't* know how many records will be present beforehand, however.

What I need to do is convert the csv file to an xml file.

I've created a "header" and a "footer" which contains all the xml guff, ready to just dump the column of values in between - this much I have actually managed to get done ... but there's a problem ...

Our XML file cannot have more then 8,000 characters between any one tag - I think it may actually be the entire file ... consider that unconfirmed - I don't have any problems massaging a solution to fit those constraints, however, so we can take it as a hard limit for the number of characters between the tags (this being those userID's in-between the xml tags)

So, I actually need to do the following steps ...

1) strip the column header off the csv
2) replace all new lines with spaces
3) split the file into fragments having no more than 8,000 characters
3) check for and remove any comma at the end of each fragment
4) put the header and footer on each file fragment
5) save each fragment with a different name having an xml extension

I have used awk for 1 & 2, I used cat for 4 and, when I realised I needed to chop up the file, I used split to break it up. Once I know how to do those things ... I couldn't get any further.
I ran into the problem of removing the final comma and cat'ing each fragment as a different file, etc ...

I would really like a solution which doesn't use intermediate files, though I can live with that, my (insufficient) solution already does that.
I'm fine with creating a shell script for the solution (current one is a script) but I'd prefer to stick to awk and/or other built-in tools as I do not have admin privileges on this machine.

Any questions, I'll be lurking all weekend ... any practical advice is welcome - at this point I think I'd have to learn awk as a language but I'm flat-out with learning and coding in R (and C++) to have the brain space for learning that as well.

Thanks for any help

rknichols · 03-10-2017, 08:32 PM

It's hard to understand what you are trying to do without seeing some sample data. For example, taking a csv file with column headings and replacing the newlines with spaces would seem to destroy the columnar nature of the csv file.

yakkmeister · 03-10-2017, 08:39 PM

Quote:

Originally Posted by rknichols

It's hard to understand what you are trying to do without seeing some sample data. For example, taking a csv file with column headings and replacing the newlines with spaces would seem to destroy the columnar nature of the csv file.

It does. That's the intended behaviour

sample scv data could be anything like:

Code:

internal_ID
123
456
789
...
999

sample output could be like:

output_1.xml

Code:

<xml stuff>123, 456</xml stuff>

output_2.xml

Code:

<xml stuff>789, ..., 999</xml stuff>

rknichols · 03-10-2017, 09:02 PM

OK, so is this csv file really simple, or are some of the fields quoted strings perhaps with embedded spaces or commas, or even literal quote marks? The syntax of a csv file is not the easiest thing to parse, and there are variants, such as whether leading white space in a field is considered part of the field or is ignored. I've used awk to parse csv syntax, and while not terribly difficult, it certainly wasn't easy. I understand that perl has better support for csv files, but my knowledge of perl is approximately zero.

If it's a really simple csv file with no quoted strings or extraneous white space, then the task is easy and would just involve reading the file line by line, appending each element onto a new string with a comma (or other) separator, and then emitting the built-up string with appropriate header and trailer whenever adding another element would make the string too long. It would be easiest to write all of those output strings as lines in a single file, and then break up the file line-by-line later if necessary. That would save having to name and create multiple output files in awk.

yakkmeister · 03-10-2017, 09:06 PM

Quote:

Originally Posted by rknichols

OK, so is this csv file really simple, or are some of the fields quoted strings perhaps with embedded spaces or commas, or even literal quote marks? The syntax of a csv
file is not the easiest thing to parse, and there are variants, such as whether leading white space in a field is considered part of the field or is ignored. I've used awk to parse csv syntax, and while not terribly difficult, it certainly wasn't easy. I understand that perl has better support for csv files, but my knowledge of perl is approximately zero.

The csv file is output directly from teradata - it's straight-up numbers and I can force it to be anything I want, including just 1 column of nothing but numbers and newlines; no leading spaces, no tabs, no strings.

Quote:

Originally Posted by rknichols

If it's a really simple csv file with no quoted strings or extraneous white space, then the task is easy and would just involve reading the file line by line, appending each element onto a new string with a comma separator, and then emitting the built-up string with appropriate header and trailer whenever adding another element would make the string too long. It would be easiest to write all of those output strings as lines in a single file, and then break up the file line-by-line later if necessary. That would save having to name and create multiple output files in awk.

The xml files must contain the same header and footer data. The output must be several files with the same header and footer data with up-to 8,000 characters of data from the csv file.
These are structural constraints I have no control over ...
So the most-efficient process I can see is to split the csv data first then append the header/footer data to each fragment, saved as separate files. These files are fed into another automated system which expects that to be the case and can handle any number of these files. Realistically, we're unlikely to need more than 4 such fragments, but I'd prefer this to have n-fragments so it can be reused for other automated tasks with higher volumes.

rknichols · 03-10-2017, 11:07 PM

Here's a crude awk script that reads multiple lines with comma-separated fields and produces output lines that do not exceed 8000 characters with enclosing xml tags. Post processing the output to put each line into a separate file is trivial.

Code:

BEGIN {
      FS = ","
      Header = "<xml stuff>"
      Footer = "</xml stuff>"
      Limit = 7900
}
{
    for(n = 1; n <= NF; ++n) {
	if(length(outline) + length($n) < Limit)  outline = outline "," $n
	else {
	    print Header substr(outline, 2) Footer
	    outline = "," $n
	}
    }
}
END { print Header substr(outline, 2) Footer }

syg00 · 03-10-2017, 11:22 PM

Damn - I was going to charge a steak at the Brekky Creek. Another free meal bites the dust ...

If the file hasn't been concatenated already, that is also easy to do in the same awk script. Might be worth while checking length in the END block - I saw empty in some of my tests which gave just the xml tags.

yakkmeister · 03-10-2017, 11:31 PM

Quote:

Originally Posted by rknichols

Here's a crude awk script that reads multiple lines with comma-separated fields and produces output lines that do not exceed 8000 characters with enclosing xml tags. Post processing the output to put each line into a separate file is trivial.

Code:

BEGIN {
      FS = ","
      Header = "<xml stuff>"
      Footer = "</xml stuff>"
      Limit = 7900
}
{
    for(n = 1; n <= NF; ++n) {
	if(length(outline) + length($n) < Limit)  outline = outline "," $n
	else {
	    print Header substr(outline, 2) Footer
	    outline = "," $n
	}
    }
}
END { print Header substr(outline, 2) Footer }

Well, I think I just got a dose of "see how ignorant I am" ... lol!
Would you be so kind as to let me know what each bit actually does?

I think "limit" must be to check the maximum output (in characters?) against?
No idea what "outline" is doing ... is that the string containing the values from the csv file?
Does this save each fragment then to a different file?

I am legitimately lost with awk, so sorry if this is a pain ...

Turbocapitalist · 03-11-2017, 12:23 AM

perl is good for these things, too, and has a lot of ready-made modules that take care of more details and edge cases.

About the explanation, go through the awk manual and the source above and look up the various functions. It is best to get comfortable looking through the manual, it is a good reference. It is THE reference, in fact.

However, the general characteristics maybe could use some description. The BEGIN { } clause is run just once as the script starts, before the incoming data is processed. Conversely, the END { } clause is run just once after the incoming data has run out.

The main clause(s) { } occur(s) once for each record, as defined by the pattern in RS, while the data lasts. There can be as many of those as you need, but here there is only one. Regular variables don't use a dollar sign. So the only tricky part I see there is n versus $n. The former is just a variable, the latter uses that variable to reference the field number contained in that variable. So if n=2 then $n means the same as $2, if n=4, then $n means the same as $4, and so on.

yakkmeister · 03-11-2017, 12:38 AM

Quote:

Originally Posted by Turbocapitalist

perl is good for these things, too, and has a lot of ready-made modules that take care of more details and edge cases.

About the explanation, go through the awk manual and the source above and look up the various functions. It is best to get comfortable looking through the manual, it is a good reference. It is THE reference, in fact.

However, the general characteristics maybe could use some description. The BEGIN { } clause is run just once as the script starts, before the incoming data is processed. Conversely, the END { } clause is run just once after the incoming data has run out.

The main clause(s) { } occur(s) once for each record, as defined by the pattern in RS, while the data lasts. There can be as many of those as you need, but here there is only one. Regular variables don't use a dollar sign. So the only tricky part I see there is n versus $n. The former is just a variable, the latter uses that variable to reference the field number contained in that variable. So if n=2 then $n means the same as $2, if n=4, then $n means the same as $4, and so on.

Actually, I think that just about makes sense.
Which means it's sort of missing one major part of the problem ...

Quote:

Post processing the output to put each line into a separate file ...

Which, I hear tell, is trivial ...
What I already had basically did something similar to the above (only with 90% kludge, 10% she'll-be-right-mate) with the major problem of not being able to save multiple files with unique names. So, this solution is definitely going to be tested on my data come Monday ... but how do I tell it to create a new name for each > 8,000 character fragment?

Code:

END { print Header substr(outline, 2) Footer }

The above doesn't actually write a file, unless I totally miss my guess ... I need to pipe this bad boy into something else ... ?

grail · 03-11-2017, 12:41 AM

I'm curious as to where you found this implied limit of 8000? I do not seem to be able to verify that there is any actual limit??

Also, the single column file is also easily parsed in bash:

Code:

#!/usr/bin/env bash

head="<xml stuff>"
foot="</xml stuff>"
limit=7900
file_name=output_
count=0

while read -r line
do
  (( count == 0 )) && { (( count++ ));continue; }

  if (( ${#outline} + ${#line} < limit ))
  then
    outline="$outline,$line"
  else
    echo "$head${outline#,}$foot" > "$file_name$count.xml"
    outline=",$line"
    (( count++ ))
  fi  
done< infile

echo "$head${outline#,}$foot" > "$file_name$count.xml"

syg00 · 03-11-2017, 12:42 AM

Quote:

Originally Posted by yakkmeister

I think "limit" must be to check the maximum output (in characters?) against?
No idea what "outline" is doing ... is that the string containing the values from the csv file?
Does this save each fragment then to a different file?

Yes, limit is just a sensible number rather than trying to use all 8000 characters - saves complex calculations.
outline is just that - a concatenation of records/fields up to the limit; when it is printed.
No, it prints to stdout - that is usually directed to a file. You could postprocess that to different fragments, or the print statements (in the awk above) can be directed to file(s) using an incrementing variable.

Turbocapitalist · 03-11-2017, 12:50 AM

Quote:

Originally Posted by yakkmeister

The above doesn't actually write a file, unless I totally miss my guess ... I need to pipe this bad boy into something else ... ?

You can redirect within awk itself. You can concatenate variables and strings, too. The following puts each line in a separate, numbered file:

Code:

awk '{ i++; print $2, $3 > i".txt"; }'

yakkmeister · 03-11-2017, 12:55 AM

Quote:

Originally Posted by syg00

You could postprocess that to different fragments, or the print statements (in the awk above) can be directed to file(s) using an incrementing variable.

I believe you! I don't know how to do that ...
You say the print statement is the place to do it. How about something like ...

Code:

BEGIN {
      FS = ","
      Header = "<xml stuff>"
      Footer = "</xml stuff>"
      Limit = 7900
      Fragment = 1
}
{
    for(n = 1; n <= NF; ++n) {
	if(length(outline) + length($n) < Limit)  outline = outline "," $n
	else {
	    print Header substr(outline, 2) Footer
	    outline = "," $n
	}
    }
}
END { print Header substr(outline, 2) Footer > outputFile_"${Fragment}".xml}

Is that going to actually create a new file each time the limit is met ... or make 1 file of <= limit then stop?

Oh; to invoke this puppy ...

Code:

>./script_we_made.sh the_csv_file.csv

or more like ...

Code:

>cat the_csv_file.csv > script_we_made.sh

Turbocapitalist · 03-11-2017, 01:04 AM

You'll probably want to address this fragment instead:

Code:

	    print Header substr(outline, 2) Footer

The END { } part only gets run once, but it does get run.

Again, variables don't use $ normally.

Code:

	    print Header substr(outline, 2) Footer > Fragment".xml";
            Fragment++;

The function sprintf could tidy up the numbering : sprintf("%04d", Fragment );