LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-07-2005, 07:57 PM   #1
Optimistic
Member
 
Registered: Jun 2004
Location: Germany
Distribution: Debian (testing)
Posts: 276

Rep: Reputation: 33
Using sed to add carriage retuns and line numbers.


I think that sed will be the right tool for this job. Here is what I'm trying to do: Take a block of text (like an essay) and turn it into a database where each sentence becomes one field and each sentence gets assigned a number based on the order. So, I thougt that I could use sed to add the returns (just have it look for a '.' (or some other punctuation marks) and insert a '\r' after it. But, how can I get sed to add the line numbers? Would awk be better for that?

An example of what I want to do:

Turn this:
Code:
This is a sentence of an essay.  There will be many sentences.  
Some sentences are long and some sentences are short, but all end 
with punctuation.
Into this:
Code:
1<tab>This is a sentence of an essay.
2<tab>There will be many sentences.
3<tab> Some sentences are long and some sentences are short, but all end with punctuation.
The "<tab>s" will be actual tabs, not written out.
 
Old 04-07-2005, 08:18 PM   #2
puffinman
Member
 
Registered: Jan 2005
Location: Atlanta, GA
Distribution: Gentoo, Slackware
Posts: 217

Rep: Reputation: 31
I would use perl...

Code:
#!/usr/bin/perl

undef $/; # slurp mode on

# foreach argument
foreach $file (@ARGV) {

  # slurp up this file
  open(FH, "<$file") or die "Couldn't open $file: $!";
  $everything = <FH>;

  # split into sentences
  @lines = split /(?<=[.?!])\s*/, $everything;

  # save sentences from this file
  push @all_lines, @lines;
}

# decide how many leading zeros to have in the number
$line_number = ("0"x (log(@all_lines)/log(10)))."1";

# an alternative to the above it to use a fixed number like so
# $line_number = "000001";

# print the result to stdout
foreach $line (@all_lines) {
  print $line_number++,"\t$line\n";
}
 
Old 04-07-2005, 08:33 PM   #3
Optimistic
Member
 
Registered: Jun 2004
Location: Germany
Distribution: Debian (testing)
Posts: 276

Original Poster
Rep: Reputation: 33
Excellent, thanks puffinman!
 
Old 04-07-2005, 08:42 PM   #4
mjrich
Senior Member
 
Registered: Dec 2001
Location: New Zealand
Distribution: Debian
Posts: 1,046

Rep: Reputation: 45
Code:
sed 's/\.[[:space:]]/\.\n/g' <filename> | sed = - | sed 'N;s/\n/\t/'
Blimey - Sed hasn't turned out to be more succinct than the venerable Perl, has it...

Cheers,

mj
 
Old 04-07-2005, 08:55 PM   #5
puffinman
Member
 
Registered: Jan 2005
Location: Atlanta, GA
Distribution: Gentoo, Slackware
Posts: 217

Rep: Reputation: 31
In this case, no, because your code doesn't put one sentence per line. And I was writing for readability, not for brevity. But if I must:

Code:
#!/usr/bin/perl
undef $/;
while (<>) { push @s, split /(?<=[.?!])\s*/ }
printf "%05.d\t$_\n",++$l foreach @s;

Last edited by puffinman; 04-07-2005 at 09:04 PM.
 
Old 04-07-2005, 09:06 PM   #6
homey
Senior Member
 
Registered: Oct 2003
Posts: 3,057

Rep: Reputation: 61
Maybe this...
Code:
#!/bin/bash
a=/home/file.txt
cat ${a} | tr '\n' ' ' | sed -e 's/\.[[:blank:]]*/\.\n/g' | \
awk -F"." '{OFS="\t"}{print NR".",$1"."}' > file1.txt
 
Old 04-07-2005, 09:30 PM   #7
puffinman
Member
 
Registered: Jan 2005
Location: Atlanta, GA
Distribution: Gentoo, Slackware
Posts: 217

Rep: Reputation: 31
Yes, this is oneupmanship, but I've squeezed it into 90 characters (counting the shebang), and it works for any number of input files, and it works for any ending punctuation (not just periods). Before you ask, the answer is: no life at all.

Code:
#!/usr/bin/perl
undef $/;while(<>){printf"%05.d\t$_\n",++$l foreach split/(?<=[.?!])\s*/}
 
Old 04-07-2005, 09:47 PM   #8
homey
Senior Member
 
Registered: Oct 2003
Posts: 3,057

Rep: Reputation: 61
Yes, but how about this?
Code:
#!/usr/bin/perl
undef $/;while(<> ){printf"%1.d\t$_\n",++$l foreach split/(?<=[.?!])\s*/}
 
Old 04-07-2005, 09:51 PM   #9
Optimistic
Member
 
Registered: Jun 2004
Location: Germany
Distribution: Debian (testing)
Posts: 276

Original Poster
Rep: Reputation: 33
Ahh, homey I had completely forgotten about a bash sed combo. Here is a little something I cooked up based on your script which is a bit more interactive:

Code:
#! /bin/bash
echo -n "File to split?"
read -e FILE
a=$FILE
cat ${a} | tr '\n' ' ' | sed -e 's/\.[[:blank:]]*/\.\n/g' | \
awk -F"." '{OFS="\t"}{print NR".",$1"."}' > OUT$FILE
Edit: puffinman, you rule!

Last edited by Optimistic; 04-07-2005 at 09:53 PM.
 
Old 04-07-2005, 10:15 PM   #10
homey
Senior Member
 
Registered: Oct 2003
Posts: 3,057

Rep: Reputation: 61
Thumbs up

Cool!
Actually, I wasn't competing against anyone. I just felt like posting it anyway even if everyone but me is a speed typer.
 
Old 04-07-2005, 11:46 PM   #11
puffinman
Member
 
Registered: Jan 2005
Location: Atlanta, GA
Distribution: Gentoo, Slackware
Posts: 217

Rep: Reputation: 31
Thanks Optimistic. I tend to write utilities more like the standard unix ones -- non-interactive so they can be used with pipes or redirection, or in other scripts. I hope some of this has been useful to you, or maybe has gotten more people interested in perl . In fact, sed and awk were the inspiration for a lot of perl features, which is why I read this thread in the first place.
 
Old 04-08-2005, 01:13 AM   #12
Optimistic
Member
 
Registered: Jun 2004
Location: Germany
Distribution: Debian (testing)
Posts: 276

Original Poster
Rep: Reputation: 33
puffinman: I think that I will look into perl, I probably should have long ago, but after reading a few essays by Larry Wall and some reviews about how 'crazy' perl was, I kinda decided to explore other places. Looks like I was wrong, for your script was quite elegant. I'm not a programmer (I'm one of those wierd philosopher-logicians), but I do like an elegant script and a language that allows for creativity.

Thanks agian everybody!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting numbers to currency using SED sekondborn Programming 2 10-13-2004 09:23 AM
printing line numbers? fisheromen1031 Programming 1 07-27-2004 02:19 PM
Insert character into a line with sed? & variables in sed? jago25_98 Programming 5 03-11-2004 06:12 AM
Help.. how do I add two numbers? Tengil Linux - Newbie 3 03-04-2004 12:58 PM
sed: replace one line with >one line bbeers Programming 3 11-19-2002 05:27 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:59 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration