LinuxQuestions.org - Manipulating text

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Manipulating text (https://www.linuxquestions.org/questions/linux-newbie-8/manipulating-text-673889/)

phillipseamore

10-02-2008 11:58 PM

Manipulating text

Hi all,

I'm in desperate need of reformatting an old text file with some 5300 lines of text. I think that awk or sed is the way to go but am getting confused.

The original format is like this:

Code:

in:  160.11

See our bus? It's really big.

out: 165.02

in:  165.12

-Is it the great big one here?

-That's right.

out: 171.03

...

But the new format needs to be like this:

Code:

@

160.11                165.02

See our bus? It's really big. 

@

165.12                171.03

-Is it the great big one here?

-That's right.

Anyone with a clue about how to proceed ?

Thanks,
Phil

kenneho

10-03-2008 01:29 AM

Hi.

I personally don't think sed it the way to go here. All you seem to be doing is moving text, not manupilating it much.

I haven't used awk much, so I don't have an opinion here. But a small shell or perl script would do the job quite easily. All you have to do i read a few lines at the time, store the lines or tokens in variables, and the write the stored variables to a file.

chrism01

10-03-2008 03:08 AM

Yeah, for re-formatting like that I'd use Perl. These links will help:
http://perldoc.perl.org/
http://www.perlmonks.org/?node=Tutorials

phillipseamore

10-03-2008 03:19 AM

Thanks so much kenneho and chrism01. I feel like such an idiot now! :)

I managed to nail together a quick parser for this from PHP actually:

Code:

<?php



$handle = fopen("SUB.txt", "r");

$text = "";



while (!feof($handle)) {

    $buffer = fgets($handle, 4096);



                if (substr($buffer,0,3) == "in:") {

                        echo "@\r";

                        echo rtrim(substr($buffer,5,6),"\n");

                }

                                                        



                if (substr($buffer,0,3) != "in:" && substr($buffer,0,4) != "out:" ) {

                        $text .= $buffer; 

          }



                if (substr($buffer,0,4) == "out:") {

                        echo "\t\t";

                        echo rtrim(substr($buffer,5,6),"\n");

                        echo "\r";

                        echo $text;

                        $text = "";

                }



}

fclose ($handle);



?>

MarkBurke

10-03-2008 09:45 PM

awk attempt

1.awk
==
begin { INSIDE=NO ; }
/in:/ { INVALUE = $2 ; INSIDE=YES; }

!/in:/&&!/out:/ {
if (length(lines) > 0)
lines=sprintf("%s\n%s",lines,$0);
else lines=$0 ;
}

/out:/ {
OUTVALUE=$2 ;
printf("@\n%s\t\t%s\n",INVALUE,OUTVALUE);
print lines ; lines = "" ;
INSIDE=NO
}

end { print lines ; }
==

gawk -f 1.awk < 1.in

==
in:

in: 160.11
See our bus? It's really big.
out: 165.02
in: 165.12
-Is it the great big one here?
-That's right.
out: 171.03
...

out:

@
160.11 165.02
See our bus? It's really big.
@
165.12 171.03
-Is it the great big one here?
-That's right.

Mr. C.

10-03-2008 10:37 PM

Here's a simpler version I think:

Code:

$ cat doit.pl

#!/usr/bin/perl



undef $/;

$_ = <>;



s/^in:[ \t]+(\d+\.\d+)\n(.*?)\n^out:[ \t]+(\d+\.\d+)\n/@\n\1      \3\n\2\n/gms;



print "$_";



$ ./doit.pl data

@

160.11      165.02

See our bus? It's really big.

@

165.12      171.03

-Is it the great big one here?

-That's right.

archtoad6

10-05-2008 12:51 PM

If you don't mind preprocessing w/ sed :), this is pretty short & sweet:

Code:

#! /bin/bash

cat $1                                \

| sed -r 's,^out: *([^ ]*),@\1\n,                

  /^in/N;s,^in: *([^ ]+).*\n,\1\n@,'  \

| awk 'BEGIN{RS=""; FS="\n@"}; 

  {print "@","\n"$1,$3,"\n"$2}' 

exit  # end of script





## test output:  

@

160.11 165.02

See our bus? It's really big.

@

165.12 171.03

-Is it the great big one here?

-That's right.

I might add this is the 1st time I have seen a perl script of <100 lines that is shorter than my bash equivalent.

EDIT:
I spoke too soon -- here is an all sed script that is the shortest of all:

Code:

#!/bin/sed -rf

:n;N;s,out: *,@,;T n

s,in: *([^\n]*)(.*)\n@(.*),@\n\1 \3\2,

I don't know if this is good or bad -- we can only guess at the true format of the actual data -- but I don't assume that the "in:" & "out:" data will be digits. I could if needed.

Happy to answer any Q's about how it works.
(http://www.gnu.org/software/sed/manual/sed.html)

All times are GMT -5. The time now is 04:54 PM.