LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-10-2017, 07:47 PM   #1
yakkmeister
Member
 
Registered: Jan 2005
Location: QLD Australia
Distribution: Debian 10 'Buster'
Posts: 78

Rep: Reputation: 16
Post AWKward question: Splitting files, joining some other files to them and then saving the results ...


Hello everyperson!

I've been working on a few problems at my job but one has been exceptionally difficult for me to solve ... mostly due to the arcane rites required to make awk do something.
You can read that as "has no clue about how awk syntax work" :P

So here's what I've got to get working ...

I have 1 csv file with anything from 1-1m userID numbers. I can ensure this file contains only 1 column if need be (and I'm pretty sure we're doing it now ... ) so I can be certain of the column, solutions can definitely make use of that.
I *don't* know how many records will be present beforehand, however.

What I need to do is convert the csv file to an xml file.

I've created a "header" and a "footer" which contains all the xml guff, ready to just dump the column of values in between - this much I have actually managed to get done ... but there's a problem ...

Our XML file cannot have more then 8,000 characters between any one tag - I think it may actually be the entire file ... consider that unconfirmed - I don't have any problems massaging a solution to fit those constraints, however, so we can take it as a hard limit for the number of characters between the tags (this being those userID's in-between the xml tags)

So, I actually need to do the following steps ...

1) strip the column header off the csv
2) replace all new lines with spaces
3) split the file into fragments having no more than 8,000 characters
3) check for and remove any comma at the end of each fragment
4) put the header and footer on each file fragment
5) save each fragment with a different name having an xml extension

I have used awk for 1 & 2, I used cat for 4 and, when I realised I needed to chop up the file, I used split to break it up. Once I know how to do those things ... I couldn't get any further.
I ran into the problem of removing the final comma and cat'ing each fragment as a different file, etc ...

I would really like a solution which doesn't use intermediate files, though I can live with that, my (insufficient) solution already does that.
I'm fine with creating a shell script for the solution (current one is a script) but I'd prefer to stick to awk and/or other built-in tools as I do not have admin privileges on this machine.

Any questions, I'll be lurking all weekend ... any practical advice is welcome - at this point I think I'd have to learn awk as a language but I'm flat-out with learning and coding in R (and C++) to have the brain space for learning that as well.

Thanks for any help
 
Old 03-10-2017, 08:32 PM   #2
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,779

Rep: Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212
It's hard to understand what you are trying to do without seeing some sample data. For example, taking a csv file with column headings and replacing the newlines with spaces would seem to destroy the columnar nature of the csv file.
 
Old 03-10-2017, 08:39 PM   #3
yakkmeister
Member
 
Registered: Jan 2005
Location: QLD Australia
Distribution: Debian 10 'Buster'
Posts: 78

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by rknichols View Post
It's hard to understand what you are trying to do without seeing some sample data. For example, taking a csv file with column headings and replacing the newlines with spaces would seem to destroy the columnar nature of the csv file.
It does. That's the intended behaviour


sample scv data could be anything like:

Code:
internal_ID
123
456
789
...
999
sample output could be like:

output_1.xml
Code:
<xml stuff>123, 456</xml stuff>
output_2.xml
Code:
<xml stuff>789, ..., 999</xml stuff>
 
Old 03-10-2017, 09:02 PM   #4
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,779

Rep: Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212
OK, so is this csv file really simple, or are some of the fields quoted strings perhaps with embedded spaces or commas, or even literal quote marks? The syntax of a csv file is not the easiest thing to parse, and there are variants, such as whether leading white space in a field is considered part of the field or is ignored. I've used awk to parse csv syntax, and while not terribly difficult, it certainly wasn't easy. I understand that perl has better support for csv files, but my knowledge of perl is approximately zero.

If it's a really simple csv file with no quoted strings or extraneous white space, then the task is easy and would just involve reading the file line by line, appending each element onto a new string with a comma (or other) separator, and then emitting the built-up string with appropriate header and trailer whenever adding another element would make the string too long. It would be easiest to write all of those output strings as lines in a single file, and then break up the file line-by-line later if necessary. That would save having to name and create multiple output files in awk.

Last edited by rknichols; 03-10-2017 at 09:04 PM.
 
Old 03-10-2017, 09:06 PM   #5
yakkmeister
Member
 
Registered: Jan 2005
Location: QLD Australia
Distribution: Debian 10 'Buster'
Posts: 78

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by rknichols View Post
OK, so is this csv file really simple, or are some of the fields quoted strings perhaps with embedded spaces or commas, or even literal quote marks? The syntax of a csv
file is not the easiest thing to parse, and there are variants, such as whether leading white space in a field is considered part of the field or is ignored. I've used awk to parse csv syntax, and while not terribly difficult, it certainly wasn't easy. I understand that perl has better support for csv files, but my knowledge of perl is approximately zero.
The csv file is output directly from teradata - it's straight-up numbers and I can force it to be anything I want, including just 1 column of nothing but numbers and newlines; no leading spaces, no tabs, no strings.

Quote:
Originally Posted by rknichols View Post
If it's a really simple csv file with no quoted strings or extraneous white space, then the task is easy and would just involve reading the file line by line, appending each element onto a new string with a comma separator, and then emitting the built-up string with appropriate header and trailer whenever adding another element would make the string too long. It would be easiest to write all of those output strings as lines in a single file, and then break up the file line-by-line later if necessary. That would save having to name and create multiple output files in awk.
The xml files must contain the same header and footer data. The output must be several files with the same header and footer data with up-to 8,000 characters of data from the csv file.
These are structural constraints I have no control over ...
So the most-efficient process I can see is to split the csv data first then append the header/footer data to each fragment, saved as separate files. These files are fed into another automated system which expects that to be the case and can handle any number of these files. Realistically, we're unlikely to need more than 4 such fragments, but I'd prefer this to have n-fragments so it can be reused for other automated tasks with higher volumes.

Last edited by yakkmeister; 03-10-2017 at 09:16 PM. Reason: Realised previous comment needed more information
 
Old 03-10-2017, 11:07 PM   #6
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,779

Rep: Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212
Here's a crude awk script that reads multiple lines with comma-separated fields and produces output lines that do not exceed 8000 characters with enclosing xml tags. Post processing the output to put each line into a separate file is trivial.
Code:
BEGIN {
      FS = ","
      Header = "<xml stuff>"
      Footer = "</xml stuff>"
      Limit = 7900
}
{
    for(n = 1; n <= NF; ++n) {
	if(length(outline) + length($n) < Limit)  outline = outline "," $n
	else {
	    print Header substr(outline, 2) Footer
	    outline = "," $n
	}
    }
}
END { print Header substr(outline, 2) Footer }
 
2 members found this post helpful.
Old 03-10-2017, 11:22 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Damn - I was going to charge a steak at the Brekky Creek. Another free meal bites the dust ...

If the file hasn't been concatenated already, that is also easy to do in the same awk script. Might be worth while checking length in the END block - I saw empty in some of my tests which gave just the xml tags.

Last edited by syg00; 03-10-2017 at 11:25 PM. Reason: ignore that bit
 
Old 03-10-2017, 11:31 PM   #8
yakkmeister
Member
 
Registered: Jan 2005
Location: QLD Australia
Distribution: Debian 10 'Buster'
Posts: 78

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by rknichols View Post
Here's a crude awk script that reads multiple lines with comma-separated fields and produces output lines that do not exceed 8000 characters with enclosing xml tags. Post processing the output to put each line into a separate file is trivial.
Code:
BEGIN {
      FS = ","
      Header = "<xml stuff>"
      Footer = "</xml stuff>"
      Limit = 7900
}
{
    for(n = 1; n <= NF; ++n) {
	if(length(outline) + length($n) < Limit)  outline = outline "," $n
	else {
	    print Header substr(outline, 2) Footer
	    outline = "," $n
	}
    }
}
END { print Header substr(outline, 2) Footer }
Well, I think I just got a dose of "see how ignorant I am" ... lol!
Would you be so kind as to let me know what each bit actually does?

I think "limit" must be to check the maximum output (in characters?) against?
No idea what "outline" is doing ... is that the string containing the values from the csv file?
Does this save each fragment then to a different file?

I am legitimately lost with awk, so sorry if this is a pain ...
 
Old 03-11-2017, 12:23 AM   #9
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
perl is good for these things, too, and has a lot of ready-made modules that take care of more details and edge cases.

About the explanation, go through the awk manual and the source above and look up the various functions. It is best to get comfortable looking through the manual, it is a good reference. It is THE reference, in fact.

However, the general characteristics maybe could use some description. The BEGIN { } clause is run just once as the script starts, before the incoming data is processed. Conversely, the END { } clause is run just once after the incoming data has run out.

The main clause(s) { } occur(s) once for each record, as defined by the pattern in RS, while the data lasts. There can be as many of those as you need, but here there is only one. Regular variables don't use a dollar sign. So the only tricky part I see there is n versus $n. The former is just a variable, the latter uses that variable to reference the field number contained in that variable. So if n=2 then $n means the same as $2, if n=4, then $n means the same as $4, and so on.
 
1 members found this post helpful.
Old 03-11-2017, 12:38 AM   #10
yakkmeister
Member
 
Registered: Jan 2005
Location: QLD Australia
Distribution: Debian 10 'Buster'
Posts: 78

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by Turbocapitalist View Post
perl is good for these things, too, and has a lot of ready-made modules that take care of more details and edge cases.

About the explanation, go through the awk manual and the source above and look up the various functions. It is best to get comfortable looking through the manual, it is a good reference. It is THE reference, in fact.

However, the general characteristics maybe could use some description. The BEGIN { } clause is run just once as the script starts, before the incoming data is processed. Conversely, the END { } clause is run just once after the incoming data has run out.

The main clause(s) { } occur(s) once for each record, as defined by the pattern in RS, while the data lasts. There can be as many of those as you need, but here there is only one. Regular variables don't use a dollar sign. So the only tricky part I see there is n versus $n. The former is just a variable, the latter uses that variable to reference the field number contained in that variable. So if n=2 then $n means the same as $2, if n=4, then $n means the same as $4, and so on.
Actually, I think that just about makes sense.
Which means it's sort of missing one major part of the problem ...

Quote:
Post processing the output to put each line into a separate file ...
Which, I hear tell, is trivial ...
What I already had basically did something similar to the above (only with 90% kludge, 10% she'll-be-right-mate) with the major problem of not being able to save multiple files with unique names. So, this solution is definitely going to be tested on my data come Monday ... but how do I tell it to create a new name for each > 8,000 character fragment?

Code:
END { print Header substr(outline, 2) Footer }
The above doesn't actually write a file, unless I totally miss my guess ... I need to pipe this bad boy into something else ... ?
 
Old 03-11-2017, 12:41 AM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I'm curious as to where you found this implied limit of 8000? I do not seem to be able to verify that there is any actual limit??

Also, the single column file is also easily parsed in bash:
Code:
#!/usr/bin/env bash

head="<xml stuff>"
foot="</xml stuff>"
limit=7900
file_name=output_
count=0

while read -r line
do
  (( count == 0 )) && { (( count++ ));continue; }

  if (( ${#outline} + ${#line} < limit ))
  then
    outline="$outline,$line"
  else
    echo "$head${outline#,}$foot" > "$file_name$count.xml"
    outline=",$line"
    (( count++ ))
  fi  
done< infile

echo "$head${outline#,}$foot" > "$file_name$count.xml"

Last edited by grail; 03-11-2017 at 12:48 AM. Reason: Forgot about header and separate files
 
1 members found this post helpful.
Old 03-11-2017, 12:42 AM   #12
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Quote:
Originally Posted by yakkmeister View Post
I think "limit" must be to check the maximum output (in characters?) against?
No idea what "outline" is doing ... is that the string containing the values from the csv file?
Does this save each fragment then to a different file?
Yes, limit is just a sensible number rather than trying to use all 8000 characters - saves complex calculations.
outline is just that - a concatenation of records/fields up to the limit; when it is printed.
No, it prints to stdout - that is usually directed to a file. You could postprocess that to different fragments, or the print statements (in the awk above) can be directed to file(s) using an incrementing variable.
 
1 members found this post helpful.
Old 03-11-2017, 12:50 AM   #13
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
Quote:
Originally Posted by yakkmeister View Post
The above doesn't actually write a file, unless I totally miss my guess ... I need to pipe this bad boy into something else ... ?
You can redirect within awk itself. You can concatenate variables and strings, too. The following puts each line in a separate, numbered file:

Code:
awk '{ i++; print $2, $3 > i".txt"; }'
 
1 members found this post helpful.
Old 03-11-2017, 12:55 AM   #14
yakkmeister
Member
 
Registered: Jan 2005
Location: QLD Australia
Distribution: Debian 10 'Buster'
Posts: 78

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by syg00 View Post
You could postprocess that to different fragments, or the print statements (in the awk above) can be directed to file(s) using an incrementing variable.
I believe you! I don't know how to do that ...
You say the print statement is the place to do it. How about something like ...

Code:
BEGIN {
      FS = ","
      Header = "<xml stuff>"
      Footer = "</xml stuff>"
      Limit = 7900
      Fragment = 1
}
{
    for(n = 1; n <= NF; ++n) {
	if(length(outline) + length($n) < Limit)  outline = outline "," $n
	else {
	    print Header substr(outline, 2) Footer
	    outline = "," $n
	}
    }
}
END { print Header substr(outline, 2) Footer > outputFile_"${Fragment}".xml}
Is that going to actually create a new file each time the limit is met ... or make 1 file of <= limit then stop?

Oh; to invoke this puppy ...
Code:
>./script_we_made.sh the_csv_file.csv
or more like ...
Code:
>cat the_csv_file.csv > script_we_made.sh

Last edited by yakkmeister; 03-11-2017 at 12:59 AM. Reason: Forgot something
 
Old 03-11-2017, 01:04 AM   #15
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
You'll probably want to address this fragment instead:
Code:
	    print Header substr(outline, 2) Footer
The END { } part only gets run once, but it does get run.

Again, variables don't use $ normally.

Code:
	    print Header substr(outline, 2) Footer > Fragment".xml";
            Fragment++;

The function sprintf could tidy up the numbering : sprintf("%04d", Fragment );
 
1 members found this post helpful.
  


Reply

Tags
awk, csv, script, shell, xml



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
splitting files pinoyskull Programming 5 10-22-2007 10:30 PM
Script: splitting lines in multiple files and joining them timmay9162 Programming 28 04-14-2006 08:52 AM
splitting files columnwise manjushp Programming 3 08-25-2005 04:49 PM
splitting vob files ssfrstlstnm Linux - Software 2 07-10-2005 03:55 PM
OpenOffice files splitting bezaleel Linux - Software 0 07-22-2004 03:39 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:11 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration