bash

babag · 02-25-2022, 02:28 PM

I have text files that I need to prep before bringing them in to libreoffice calc. The lines I want to edit look similar to this:

Code:

       4096 2012-07-28 19:32:20     /media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/
     204096 2010-09-23 14:53:47     /media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/update/

The space after the time is a tab that I've inserted as a delimiter. I need to remove all of the spaces up to but not past the tab in each line. (I want it to stop at the tab because there can be entries for the directories that follow it that might have spaces in their names that need to be preserved.)

I've looked at sed and awk but haven't been able to figure out how to do this.

This is what I'd like to end up with:

Code:

40962012-07-2819:32:20     /media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/
2040962010-09-2314:53:47     /media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/update/

edit:
This removes all of the spaces but goes beyond the tab, which would affect file/directory names:

Code:

tr -d " " < infile.txt > outfile.txt

thanks for any help,
babag

boughtonp · 02-25-2022, 04:22 PM

There don't appear to be any tabs in your post at all.

Assuming the following as what your input actually looks like...

Code:

       4096	2012-07-28	19:32:20	/media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/
     204096	2010-09-23	14:53:47	/media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/update/

(In the following examples, "cat -A" is used to visualize tabs as ^I and end of lines as $)

Based on what you posted, it might be enough to simply remove spaces at the start of a string (which can be indicated with "^"), so:

Code:

$ cat -A input.txt
       4096^I2012-07-28^I19:32:20^I/media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/$
     204096^I2010-09-23^I14:53:47^I/media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/update/

$ sed 's/^  *//' input.txt | cat -A
4096^I2012-07-28^I19:32:20^I/media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/$
204096^I2010-09-23^I14:53:47^I/media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/update/

But potentially more useful is to remove all spaces that are adjacent to tabs (or start/end of string), e.g:

Code:

$ cat -A input2.txt
       4096^I2012-07-28^I19:32:20^I   /media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/   $
     204096^I2010-09-23^I14:53:47  ^I  /media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/update/

$ sed -r 's/ *(^|$|\t) */\1/g' input2.txt | cat -A
4096^I2012-07-28^I19:32:20^I/media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/$
204096^I2010-09-23^I14:53:47^I/media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/update/

babag · 02-25-2022, 05:42 PM

Thanks for the response, boughtonp.

Actually, my input looks exactly like what I posted. There are leading spaces, then there is a space between each of the size/date/time listings, followed by a tab, then the directories.

I want to delete the leading spaces, spaces between size/date/time, stop there, retaining the tab after time and any spaces in directory/filenames.

I'll see what the things you posted do to a test file.

edit:
Just ran both commands on a test file and each:

Code:

sed 's/^  *//' input.txt | cat -A
sed -r 's/ *(^|$|\t) */\1/g' input2.txt | cat -A

seems to do the opposite of what I was looking for. They both retain the spaces preceding the tab and delete the tab.

thanks again,
babag

astrogeek · 02-25-2022, 05:55 PM

Maybe something like this...

Code:

cat infile
       4096 2012-07-28 19:32:20 /media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/
     204096 2010-09-23 14:53:47 /media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/update/


awk 'BEGIN{FS="[\t]";}{gsub(" ","",$1); print $1"\t"$2}' infile
40962012-07-2819:32:20  /media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/
2040962010-09-2314:53:47        /media/babag/Projects_01_A/Audio_Group-00/e135208cdf2d721346eb/update/

Of course, this presumes that "your" <tab> is the one and only <tab> in each line.

babag · 02-25-2022, 06:09 PM

Thanks astrogeek! That did it. I like that it's awk too. Also seems to preserve spaces in directories/filenames. And, yes, there's only a single tab per line.

thanks again,
babag

astrogeek · 02-25-2022, 06:11 PM

You are welcome, glad it helped!

boughtonp · 02-26-2022, 07:36 AM

So you do actually want to merge the id, date and time into a single field?!?

That seems like an odd thing to do - makes me think there's perhaps a different underlying issue - but anyway I would do a slight variation of Astrogeek's solution:

Code:

awk 'BEGIN{FS="\t";OFS="\t"}{gsub(" ","",$1); print}' infile

The main benefit being that if there is a third/fourth/etc field it still works, without the need to explicitly add them.

MadeInGermany · 02-27-2022, 06:28 AM

(Only) the last awk solution (with FS=OFS="\t") is okay, because changing a field causes a rebuild of the input line using OFS as a field separator.

sed solutions:
if you know there are two fields

Code:

sed 's/^ *\([^ ]*\) \([^ ]*\)/\1\2/'

If you want to remove spaces before a tab without knowing the format then you must use a loop

Code:

sed -e ':L' -e 's/^\([^\t]*\) /\1/; tL'

syg00 · 02-27-2022, 06:37 AM

Rubbish - read post #5. Why do you always have to pontificate ?.

GazL · 02-27-2022, 08:07 AM

Here's a solution just using bash commands:

Code:

while read num date time dir
do
  printf "%d%s%s\t%s\n" "$num" "$date" "$time" "$dir"
done < /tmp/input.txt

It will cope with spaces in the dirname so long as they're not leading spaces (which read will strip).

bash is not the fastest however, so if you have millions of rows, you might want to use one of the other solutions, but you did ask how to do it in bash.

pan64 · 02-27-2022, 08:22 AM

Quote:

Originally Posted by GazL

Here's a solution just using bash commands:

Code:

while read num date time dir
do
  printf "%d%s%s\t%s\n" "$num" "$date" "$time" "$dir"
done < /tmp/input.txt

It will cope with spaces in the dirname so long as they're not leading spaces (which read will strip).

bash is not the fastest however, so if you have millions of rows, you might want to use one of the other solutions, but you did ask how to do it in bash.

Yes, that's what I wanted to say too. Split line by whitespaces and reconstruct the line. Either in bash or awk/perl/python/sed/whatever, like this

Code:

awk '{ printf "%d%s%s\t%s\n", $1, $2, $3, $4 }'