Text Manipulation

nooobeee · 01-23-2018, 06:43 AM

I have a file with approximately 16 million rows and need to insert date/time stamps for each row in 1 hour increments for an entire year and then starting over when a new value is in the first column.

ie.

Code:

A,40,30
A,53,20
A,33,10
...
A,55,49
B,11,34
B,43,79
B,59,76
...
B,18,77
C,64,29

to

Code:

A,01/01/2017 01:00,40,30
A,01/01/2017 02:00,53,20
A,01/01/2017 03:00,33,10
...
A,12/31/2017 23:00,55,49
B,01/01/2017 01:00,11,34
B,01/01/2017 02:00,43,79
B,01/01/2017 03:00,59,76
...
B,12/31/2017 23:00,18,77
C,01/01/2017 01:00,64,29

I thought about converting it to epoch and then back but had difficulty with that. But I noticed perl had some date/time functions that might work well. Any thoughts on how this might be accomplished?

keefaz · 01-23-2018, 07:16 AM

If you use Linux, it's possible to use gawk time functions as well

eg:

Code:

awk '
BEGIN {
  timestamp=mktime("2017 01 01 00 00 00");
  hour=1
}
{
  print( strftime("%m/%d/%Y %R", timestamp + hour++ * 60 * 60) );
}
' file.txt

From this, you test if $1 changed, if yes reset hour to 1
Format output as you wish

Edit: you'll need another test if you want to go from 23:00 to 1:00

sundialsvcs · 01-23-2018, 08:04 AM

First of all, I'd stuff that data into a SQLite database file and do further manipulations with it that way.

Then, treating this file as the "database" that it in fact is, you'll need to write a short computer-program using some language other than "Bash scripting."

nooobeee · 01-23-2018, 10:34 AM

After playing around a bit, I was able to use the following:

Code:

awk 'BEGIN 
  {FS=OFS=",";timestamp=mktime("2017 01 01 00 00 00");hour=1}
 {print $1,( strftime("%D %R", timestamp + hour++ * 60 * 60)),$2,$3;
}' input.csv > output.csv

This works but continues to the next years until the end of the file. I would like it to restart the date/time when a new value is in the first column. The first column is already sorted so I'm thinking uniq could be used for this. Any thoughts on doing that?

danielbmartin · 01-23-2018, 10:53 AM

Quote:

Originally Posted by nooobeee

I would like it to restart the date/time when a new value is in the first column.... Any thoughts on doing that?

This is a code snippet to give you an idea.

With this InFile ...

Code:

2011 apple
2011 pear
2011 peach
2012 blue
2012 red
2012 yellow
2013 salty
2013 sweet
2013 bitter
2013 sour
2014 honda
2014 toyota
2014 nissan

... this awk ...

Code:

awk '{if ($1!=save) counter++; save=$1; 
      print counter,$0;}' $InFile >$OutFile

... produced this OutFile ...

Code:

1 2011 apple
1 2011 pear
1 2011 peach
2 2012 blue
2 2012 red
2 2012 yellow
3 2013 salty
3 2013 sweet
3 2013 bitter
3 2013 sour
4 2014 honda
4 2014 toyota
4 2014 nissan

Daniel B. Martin

.

nooobeee · 01-23-2018, 11:19 AM

That's good info Daniel and I'm going to take a look at that to see if I can figure out how to adjust. However my desired outcome is the following in your example:

Code:

1 2011 apple
2 2011 pear
3 2011 peach
1 2012 blue
2 2012 red
3 2012 yellow
1 2013 salty
2 2013 sweet
3 2013 bitter
4 2013 sour
1 2014 honda
2 2014 toyota
3 2014 nissan

keefaz · 01-23-2018, 11:23 AM

Quote:

Originally Posted by nooobeee

After playing around a bit, I was able to use the following:
This works but continues to the next years until the end of the file. I would like it to restart the date/time when a new value is in the first column. The first column is already sorted so I'm thinking uniq could be used for this. Any thoughts on doing that?

Yes, test if $1 has changed, for this you have to compare $1 with a value, as this value will change you need a variable to store this value.
So you have to test $1 each time with a variable value, if changed reset hour to 1

nooobeee · 01-23-2018, 11:42 AM

Quote:

Originally Posted by keefaz

Yes, test if $1 has changed, for this you have to compare $1 with a value, as this value will change you need a variable to store this value.
So you have to test $1 each time with a variable value, if changed reset hour to 1

Ok, it seems you may be a few steps ahead of me (thanks for your patience). If i'm understanding you correctly, I need to store the first $1 as a user defined variable and add the command to check $1 against the stored variable. If it matches, continue but if it doesn't, then reset the counter, overwrite the variable, and continue. Is that what you're saying?

keefaz · 01-23-2018, 11:44 AM

It sounds correct, post code if you're stuck

danielbmartin · 01-23-2018, 12:15 PM

Quote:

Originally Posted by nooobeee

I would like it to restart the date/time when a new value is in the first column.... Any thoughts on doing that?

This is a code snippet to give you an idea.

With this InFile ...

Code:

2011 apple
2011 pear
2011 peach
2012 blue
2012 red
2012 yellow
2013 salty
2013 sweet
2013 bitter
2013 sour
2014 honda
2014 toyota
2014 nissan

... this awk ...

Code:

awk '{if ($1!=save) j=0; save=$1; j++
      print j,$0}' $InFile >$OutFile

... produced this OutFile ...

Code:

1 2011 apple
2 2011 pear
3 2011 peach
1 2012 blue
2 2012 red
3 2012 yellow
1 2013 salty
2 2013 sweet
3 2013 bitter
4 2013 sour
1 2014 honda
2 2014 toyota
3 2014 nissan

Daniel B. Martin

.

nooobeee · 01-23-2018, 06:54 PM

I appreciate you forcing me to understand it and figure it out Daniel...though frustrating, i'm sure i'll be better off for it. Anyway I think i understand what your code is doing by checking if $1 does not equal save as well as saving the $1 value and finally incrementing the value for j which is initially set at 0. It seems that the mktime and strftime are throwing me off now (perhaps due to lack of understanding). I mangled together the following code with the included errors:

Code:

awk 'BEGIN {FS=OFS=",";hour=1;if ($1!=u) timestamp=mktime("2017 01 01 00 00 00");u=$1;(strftime("%D %R", timestamp + hour++ * 60 * 60))}{print $1,(strftime("%D %R", timestamp)),$2;}' input.csv > output.csv
awk: line 2: function strftime never defined
awk: line 2: function strftime never defined
awk: line 2: function mktime never defined

I seem to recall getting the error about defining one or both of those before but it seems they're defined here. what am i missing?

danielbmartin · 01-23-2018, 07:05 PM

Post withdrawn. Advice was wrong.

nooobeee · 01-23-2018, 07:15 PM

I don't logically see the need for them either, removed them (from both strftime instances) but no change.

nooobeee · 01-23-2018, 07:16 PM

oh wait, i think i need to install gawk or something like that...i'm pretty sure i had this on my system at work and that's how i addressed it

keefaz · 01-23-2018, 07:18 PM

Quote:

Originally Posted by nooobeee

I appreciate you forcing me to understand it and figure it out Daniel...though frustrating, i'm sure i'll be better off for it. Anyway I think i understand what your code is doing by checking if $1 does not equal save as well as saving the $1 value and finally incrementing the value for j which is initially set at 0. It seems that the mktime and strftime are throwing me off now (perhaps due to lack of understanding). I mangled together the following code with the included errors:

Code:

awk 'BEGIN {FS=OFS=",";hour=1;if ($1!=u) timestamp=mktime("2017 01 01 00 00 00");u=$1;(strftime("%D %R", timestamp + hour++ * 60 * 60))}{print $1,(strftime("%D %R", timestamp)),$2;}' input.csv > output.csv
awk: line 2: function strftime never defined
awk: line 2: function strftime never defined
awk: line 2: function mktime never defined

I seem to recall getting the error about defining one or both of those before but it seems they're defined here. what am i missing?

Close, but you moved things in wrong places

awk syntax is like: condition {action}
so:

Code:

awk '
BEGIN {FS=OFS=",";timestamp=mktime("2017 01 01 00 00 00")}
$1!=u {hour=1;u=$1}
{print $1, strftime("%D %R", timestamp + hour++ * 60 * 60),$2, $3}
' input.csv

BEGIN{}: initialize block
$1 != u: condition
if condtion is true then next block: {hour=1;u=$1}
and following block is executed without condition