[SOLVED] AWKward question: Splitting files, joining some other files to them and then saving the results ...
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
AWKward question: Splitting files, joining some other files to them and then saving the results ...
Hello everyperson!
I've been working on a few problems at my job but one has been exceptionally difficult for me to solve ... mostly due to the arcane rites required to make awk do something.
You can read that as "has no clue about how awk syntax work" :P
So here's what I've got to get working ...
I have 1 csv file with anything from 1-1m userID numbers. I can ensure this file contains only 1 column if need be (and I'm pretty sure we're doing it now ... ) so I can be certain of the column, solutions can definitely make use of that.
I *don't* know how many records will be present beforehand, however.
What I need to do is convert the csv file to an xml file.
I've created a "header" and a "footer" which contains all the xml guff, ready to just dump the column of values in between - this much I have actually managed to get done ... but there's a problem ...
Our XML file cannot have more then 8,000 characters between any one tag - I think it may actually be the entire file ... consider that unconfirmed - I don't have any problems massaging a solution to fit those constraints, however, so we can take it as a hard limit for the number of characters between the tags (this being those userID's in-between the xml tags)
So, I actually need to do the following steps ...
1) strip the column header off the csv
2) replace all new lines with spaces
3) split the file into fragments having no more than 8,000 characters
3) check for and remove any comma at the end of each fragment
4) put the header and footer on each file fragment
5) save each fragment with a different name having an xml extension
I have used awk for 1 & 2, I used cat for 4 and, when I realised I needed to chop up the file, I used split to break it up. Once I know how to do those things ... I couldn't get any further.
I ran into the problem of removing the final comma and cat'ing each fragment as a different file, etc ...
I would really like a solution which doesn't use intermediate files, though I can live with that, my (insufficient) solution already does that.
I'm fine with creating a shell script for the solution (current one is a script) but I'd prefer to stick to awk and/or other built-in tools as I do not have admin privileges on this machine.
Any questions, I'll be lurking all weekend ... any practical advice is welcome - at this point I think I'd have to learn awk as a language but I'm flat-out with learning and coding in R (and C++) to have the brain space for learning that as well.
It's hard to understand what you are trying to do without seeing some sample data. For example, taking a csv file with column headings and replacing the newlines with spaces would seem to destroy the columnar nature of the csv file.
It's hard to understand what you are trying to do without seeing some sample data. For example, taking a csv file with column headings and replacing the newlines with spaces would seem to destroy the columnar nature of the csv file.
OK, so is this csv file really simple, or are some of the fields quoted strings perhaps with embedded spaces or commas, or even literal quote marks? The syntax of a csv file is not the easiest thing to parse, and there are variants, such as whether leading white space in a field is considered part of the field or is ignored. I've used awk to parse csv syntax, and while not terribly difficult, it certainly wasn't easy. I understand that perl has better support for csv files, but my knowledge of perl is approximately zero.
If it's a really simple csv file with no quoted strings or extraneous white space, then the task is easy and would just involve reading the file line by line, appending each element onto a new string with a comma (or other) separator, and then emitting the built-up string with appropriate header and trailer whenever adding another element would make the string too long. It would be easiest to write all of those output strings as lines in a single file, and then break up the file line-by-line later if necessary. That would save having to name and create multiple output files in awk.
OK, so is this csv file really simple, or are some of the fields quoted strings perhaps with embedded spaces or commas, or even literal quote marks? The syntax of a csv
file is not the easiest thing to parse, and there are variants, such as whether leading white space in a field is considered part of the field or is ignored. I've used awk to parse csv syntax, and while not terribly difficult, it certainly wasn't easy. I understand that perl has better support for csv files, but my knowledge of perl is approximately zero.
The csv file is output directly from teradata - it's straight-up numbers and I can force it to be anything I want, including just 1 column of nothing but numbers and newlines; no leading spaces, no tabs, no strings.
Quote:
Originally Posted by rknichols
If it's a really simple csv file with no quoted strings or extraneous white space, then the task is easy and would just involve reading the file line by line, appending each element onto a new string with a comma separator, and then emitting the built-up string with appropriate header and trailer whenever adding another element would make the string too long. It would be easiest to write all of those output strings as lines in a single file, and then break up the file line-by-line later if necessary. That would save having to name and create multiple output files in awk.
The xml files must contain the same header and footer data. The output must be several files with the same header and footer data with up-to 8,000 characters of data from the csv file.
These are structural constraints I have no control over ...
So the most-efficient process I can see is to split the csv data first then append the header/footer data to each fragment, saved as separate files. These files are fed into another automated system which expects that to be the case and can handle any number of these files. Realistically, we're unlikely to need more than 4 such fragments, but I'd prefer this to have n-fragments so it can be reused for other automated tasks with higher volumes.
Last edited by yakkmeister; 03-10-2017 at 09:16 PM.
Reason: Realised previous comment needed more information
Here's a crude awk script that reads multiple lines with comma-separated fields and produces output lines that do not exceed 8000 characters with enclosing xml tags. Post processing the output to put each line into a separate file is trivial.
Damn - I was going to charge a steak at the Brekky Creek. Another free meal bites the dust ...
If the file hasn't been concatenated already, that is also easy to do in the same awk script. Might be worth while checking length in the END block - I saw empty in some of my tests which gave just the xml tags.
Last edited by syg00; 03-10-2017 at 11:25 PM.
Reason: ignore that bit
Here's a crude awk script that reads multiple lines with comma-separated fields and produces output lines that do not exceed 8000 characters with enclosing xml tags. Post processing the output to put each line into a separate file is trivial.
Well, I think I just got a dose of "see how ignorant I am" ... lol!
Would you be so kind as to let me know what each bit actually does?
I think "limit" must be to check the maximum output (in characters?) against?
No idea what "outline" is doing ... is that the string containing the values from the csv file?
Does this save each fragment then to a different file?
I am legitimately lost with awk, so sorry if this is a pain ...
perl is good for these things, too, and has a lot of ready-made modules that take care of more details and edge cases.
About the explanation, go through the awk manual and the source above and look up the various functions. It is best to get comfortable looking through the manual, it is a good reference. It is THE reference, in fact.
However, the general characteristics maybe could use some description. The BEGIN { } clause is run just once as the script starts, before the incoming data is processed. Conversely, the END { } clause is run just once after the incoming data has run out.
The main clause(s) { } occur(s) once for each record, as defined by the pattern in RS, while the data lasts. There can be as many of those as you need, but here there is only one. Regular variables don't use a dollar sign. So the only tricky part I see there is n versus $n. The former is just a variable, the latter uses that variable to reference the field number contained in that variable. So if n=2 then $n means the same as $2, if n=4, then $n means the same as $4, and so on.
perl is good for these things, too, and has a lot of ready-made modules that take care of more details and edge cases.
About the explanation, go through the awk manual and the source above and look up the various functions. It is best to get comfortable looking through the manual, it is a good reference. It is THE reference, in fact.
However, the general characteristics maybe could use some description. The BEGIN { } clause is run just once as the script starts, before the incoming data is processed. Conversely, the END { } clause is run just once after the incoming data has run out.
The main clause(s) { } occur(s) once for each record, as defined by the pattern in RS, while the data lasts. There can be as many of those as you need, but here there is only one. Regular variables don't use a dollar sign. So the only tricky part I see there is n versus $n. The former is just a variable, the latter uses that variable to reference the field number contained in that variable. So if n=2 then $n means the same as $2, if n=4, then $n means the same as $4, and so on.
Actually, I think that just about makes sense.
Which means it's sort of missing one major part of the problem ...
Quote:
Post processing the output to put each line into a separate file ...
Which, I hear tell, is trivial ...
What I already had basically did something similar to the above (only with 90% kludge, 10% she'll-be-right-mate) with the major problem of not being able to save multiple files with unique names. So, this solution is definitely going to be tested on my data come Monday ... but how do I tell it to create a new name for each > 8,000 character fragment?
Code:
END { print Header substr(outline, 2) Footer }
The above doesn't actually write a file, unless I totally miss my guess ... I need to pipe this bad boy into something else ... ?
I think "limit" must be to check the maximum output (in characters?) against?
No idea what "outline" is doing ... is that the string containing the values from the csv file?
Does this save each fragment then to a different file?
Yes, limit is just a sensible number rather than trying to use all 8000 characters - saves complex calculations.
outline is just that - a concatenation of records/fields up to the limit; when it is printed.
No, it prints to stdout - that is usually directed to a file. You could postprocess that to different fragments, or the print statements (in the awk above) can be directed to file(s) using an incrementing variable.
You could postprocess that to different fragments, or the print statements (in the awk above) can be directed to file(s) using an incrementing variable.
I believe you! I don't know how to do that ...
You say the print statement is the place to do it. How about something like ...
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.