LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Manipulating text (https://www.linuxquestions.org/questions/linux-newbie-8/manipulating-text-673889/)

phillipseamore 10-02-2008 11:58 PM

Manipulating text
 
Hi all,

I'm in desperate need of reformatting an old text file with some 5300 lines of text. I think that awk or sed is the way to go but am getting confused.

The original format is like this:

Code:

in:  160.11
See our bus? It's really big.
out: 165.02

in:  165.12
-Is it the great big one here?
-That's right.
out: 171.03
...

But the new format needs to be like this:

Code:

@
160.11                165.02
See our bus? It's really big.

@
165.12                171.03
-Is it the great big one here?
-That's right.

Anyone with a clue about how to proceed ?

Thanks,
Phil

kenneho 10-03-2008 01:29 AM

Hi.


I personally don't think sed it the way to go here. All you seem to be doing is moving text, not manupilating it much.

I haven't used awk much, so I don't have an opinion here. But a small shell or perl script would do the job quite easily. All you have to do i read a few lines at the time, store the lines or tokens in variables, and the write the stored variables to a file.

chrism01 10-03-2008 03:08 AM

Yeah, for re-formatting like that I'd use Perl. These links will help:
http://perldoc.perl.org/
http://www.perlmonks.org/?node=Tutorials

phillipseamore 10-03-2008 03:19 AM

Thanks so much kenneho and chrism01. I feel like such an idiot now! :)

I managed to nail together a quick parser for this from PHP actually:

Code:

<?php

$handle = fopen("SUB.txt", "r");
$text = "";

while (!feof($handle)) {
    $buffer = fgets($handle, 4096);

                if (substr($buffer,0,3) == "in:") {
                        echo "@\r";
                        echo rtrim(substr($buffer,5,6),"\n");
                }
                                                       

                if (substr($buffer,0,3) != "in:" && substr($buffer,0,4) != "out:" ) {
                        $text .= $buffer;
          }

                if (substr($buffer,0,4) == "out:") {
                        echo "\t\t";
                        echo rtrim(substr($buffer,5,6),"\n");
                        echo "\r";
                        echo $text;
                        $text = "";
                }

}
fclose ($handle);

?>


MarkBurke 10-03-2008 09:45 PM

awk attempt
 
1.awk
==
begin { INSIDE=NO ; }
/in:/ { INVALUE = $2 ; INSIDE=YES; }

!/in:/&&!/out:/ {
if (length(lines) > 0)
lines=sprintf("%s\n%s",lines,$0);
else lines=$0 ;
}

/out:/ {
OUTVALUE=$2 ;
printf("@\n%s\t\t%s\n",INVALUE,OUTVALUE);
print lines ; lines = "" ;
INSIDE=NO
}

end { print lines ; }
==

gawk -f 1.awk < 1.in


==
in:

in: 160.11
See our bus? It's really big.
out: 165.02
in: 165.12
-Is it the great big one here?
-That's right.
out: 171.03
...

out:

@
160.11 165.02
See our bus? It's really big.
@
165.12 171.03
-Is it the great big one here?
-That's right.

Mr. C. 10-03-2008 10:37 PM

Here's a simpler version I think:

Code:

$ cat doit.pl
#!/usr/bin/perl

undef $/;
$_ = <>;

s/^in:[ \t]+(\d+\.\d+)\n(.*?)\n^out:[ \t]+(\d+\.\d+)\n/@\n\1      \3\n\2\n/gms;

print "$_";

$ ./doit.pl data
@
160.11      165.02
See our bus? It's really big.
@
165.12      171.03
-Is it the great big one here?
-That's right.


archtoad6 10-05-2008 12:51 PM

If you don't mind preprocessing w/ sed :), this is pretty short & sweet:
Code:

#! /bin/bash
cat $1                                \
| sed -r 's,^out: *([^ ]*),@\1\n,               
  /^in/N;s,^in: *([^ ]+).*\n,\1\n@,'  \
| awk 'BEGIN{RS=""; FS="\n@"};
  {print "@","\n"$1,$3,"\n"$2}'
exit  # end of script


## test output: 
@
160.11 165.02
See our bus? It's really big.
@
165.12 171.03
-Is it the great big one here?
-That's right.

I might add this is the 1st time I have seen a perl script of <100 lines that is shorter than my bash equivalent.

EDIT:
I spoke too soon -- here is an all sed script that is the shortest of all:
Code:

#!/bin/sed -rf
:n;N;s,out: *,@,;T n
s,in: *([^\n]*)(.*)\n@(.*),@\n\1 \3\2,

I don't know if this is good or bad -- we can only guess at the true format of the actual data -- but I don't assume that the "in:" & "out:" data will be digits. I could if needed.

Happy to answer any Q's about how it works.
(http://www.gnu.org/software/sed/manual/sed.html)


All times are GMT -5. The time now is 04:54 PM.