LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-30-2016, 05:18 PM   #1
hurd
LQ Newbie
 
Registered: Jan 2015
Posts: 12

Rep: Reputation: Disabled
Joining line ending with lowercase and starting with lowercase, or uppercase


Hi,

Fistly
I would like to join lines ending with lowercase and starting with lowercase, but with newline between them, like this:

Quote:
this is

my lines
I would like to get that:

Quote:
this is my lines
Secondly
I would like to join lines ending with lowercase and starting with uppercase, but with newline between them, like this (when there is no period):

Quote:
this is

My lines
I would like to get that:

Quote:
this is My lines

I'm not a pro in Sed, but I use it sometimes (for easy stuff); I tried something like this (find on the net) for cases where there are no newline between:
Code:
sed -r ':a;N;$!ba;s/\n([^A-Z])/ \1/g' file.txt
I do not understand everything in the above command, but I think I can reuse it for getting the result I need.

Thirdly
How can I delete each line in capital, starting or ending with a number?
Thanks for any help.
 
Old 04-30-2016, 06:10 PM   #2
tshikose
Member
 
Registered: Apr 2010
Location: Kinshasa, Democratic Republic of Congo
Distribution: RHEL, Fedora, CentOS
Posts: 286

Rep: Reputation: 61
Hi,

So actually, there is no difference between lower case or upper case new line starting?
 
Old 04-30-2016, 06:11 PM   #3
tshikose
Member
 
Registered: Apr 2010
Location: Kinshasa, Democratic Republic of Congo
Distribution: RHEL, Fedora, CentOS
Posts: 286

Rep: Reputation: 61
Or you want

Quote:
This is

My lines
to be come
Quote:
his is
My lines
???
 
Old 05-01-2016, 02:08 AM   #4
hurd
LQ Newbie
 
Registered: Jan 2015
Posts: 12

Original Poster
Rep: Reputation: Disabled
No, there is no difference between lower case and upper case new line starting.

And I really want to join these lines, like I said in the first post.

Thank you.
 
Old 05-01-2016, 03:42 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,839

Rep: Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822
Not a good idea to go using commands you don't understand. here is the online documentation. Time spent reading it will be worthwhile. Have a look at the "s" command and "Other commands" sections to work out what the command you cited in the original post does. Then you can modify it to suit what you want to do ow.
 
Old 05-01-2016, 03:44 AM   #6
hurd
LQ Newbie
 
Registered: Jan 2015
Posts: 12

Original Poster
Rep: Reputation: Disabled
OK for lower case

Well I succeed for lower case:

I have CR characters in my original files, I convert them to LF and now it works for lower case:
Code:
sed -r ':a;N;$!ba;s/\n\n([^A-Z])/ \1/g' file.txt
But remains two questions :
1) How to join lines ending in lower case with lines starting with upper case, where there is no period between them.

2) How can I delete each line in capital letters, starting or ending with a number?
Like this:
Quote:
THIS LINE NUMBER 3
or
Quote:
3 THIS LINE
Thank you.
 
Old 05-01-2016, 04:39 AM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,252

Rep: Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685
You have a few questions you need to ask yourself as some of your data is now overlapping.
Code:
this is

THIS LINE NUMBER 3
With the above example, your first code will want to join the lines, but do you then delete the entire line because it has capitals and a number, do you delete it first before joining so you just end up with 'this is', if you do, what happens to the extra line that was in between?

I agree with syg00 that you should first go and read what the sed line you have does and also work out if you need multiple sed's piped together or in a single script???
 
Old 05-01-2016, 06:17 AM   #8
AwesomeMachine
Senior Member
 
Registered: Jan 2005
Location: USA and Italy
Distribution: Debian testing/sid; OpenSuSE; Fedora
Posts: 1,832

Rep: Reputation: 259Reputation: 259Reputation: 259
Try using tr.
 
Old 05-01-2016, 11:02 AM   #9
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,430

Rep: Reputation: 1350Reputation: 1350Reputation: 1350Reputation: 1350Reputation: 1350Reputation: 1350Reputation: 1350Reputation: 1350Reputation: 1350Reputation: 1350
It appears that you are trying to parse a double spaced text file prepared in Windows.
Your needs are still unclear to me, but perhaps 'sed' is the wrong tool for the task.
Given an input file
Quote:
this is.

this is

my lines

this is

My lines

THIS LINE NUMBER 3

3 THIS LINE

this is

my lines

this is.

My lines

THIS LINE NUMBER 3

3 THIS LINE

this is.
then 'awk' can be used to produce the output
Quote:
this is.
this is my lines this is My lines this is my lines this is.
My lines this is.
If this is what you want, then the problem breaks down to:
1. Skip blank lines
2. Skip a line starting with a number
3. Skip a line ending with a number
4. If a line does not end with a period, concatenate with any previous input
5. If a line ends with a period, concatenate with any previous input and print.
This can be done with an awk script containing appropriate /regular expression/ patterns and actions.
 
Old 05-01-2016, 11:41 AM   #10
DavidMcCann
Senior Member
 
Registered: Jul 2006
Location: London
Distribution: CentOS, Salix
Posts: 4,165

Rep: Reputation: 1223Reputation: 1223Reputation: 1223Reputation: 1223Reputation: 1223Reputation: 1223Reputation: 1223Reputation: 1223Reputation: 1223
If it is a Windows text file, the problem is simply that it contains both line feeds and unwanted carriage returns. In that case, loading into an editor may give you the opportunity to ignore the CRs.
 
Old 05-01-2016, 01:37 PM   #11
hurd
LQ Newbie
 
Registered: Jan 2015
Posts: 12

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by syg00 View Post
Not a good idea to go using commands you don't understand. here is the online documentation. Time spent reading it will be worthwhile. Have a look at the "s" command and "Other commands" sections to work out what the command you cited in the original post does. Then you can modify it to suit what you want to do ow.
Hi,

You right, but I spend to much time to learn each time something new, even for something that I will use only once in my life, that I do not have enough time to live normally :s (no kidding, it is obsessive).

So I refrain myself (very hard) from learning somethings that will not benefit me for long time.

But perhaps, I should learn a little more Sed.

Finally, it is what I did (I have to admit, not enough), and I have almost finished all I need to do.

1. I cleaned the files from not matching newlines (CR+LF to LF)
Code:
for i in *.htm ; do tr '\r\n' '\n' < $i > tr-$i;  done;
(Then I removed the first untouched files from the working directory)

2. Then join lines (as needed)
Code:
for i in *.htm ; do sed -r ':a;N;$!ba;s/\n([^A-Z])/ \1/g' $i
3. Put <p> in the beginning and </p> at the end of each sentence starting with uppercase, where can follow lowercase letters, symbols and numbers (for matching the text) :
Code:
for i in *.htm ; do sed -i -r 's/.*[A-Z0-9(]+*[a-z0-9].*[a-z.:? ]$/\<p\>&/' $i ;  done;
for i in *.htm ; do sed -i -r '/^<p>/ s/$/\<\/p\>/' $i ;  done;
4. Then enclosing each title (Words in Capital) with h2 tags
Code:
sed -r 's/^.*[A-Z0-9].*[A-Z0-9]$/\<h2\>&/'
sed -r '/^<h2>/ s/$/\<\/h2\>/'
I need to do this, because I took on Internet content (From OCR: these files seem to have been formatted on a Windows OS) from scans of very rare, interesting, and "pretty" old books (may be today only available in India).

In fact, I want to make an ebook from theses files.

Now I only need to find some way to clean some stuffs, but it should be OK.

Finally, a long manual work for cleaning some OCR mistakes.

Thank you.
 
Old 05-01-2016, 02:29 PM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,252

Rep: Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685Reputation: 2685
Quote:
I spend to much time to learn each time something new, even for something that I will use only once in my life
I learnt how to add once in my life ... funnily enough I still need it now and then, this is not to say learning arithmetic is on the same level for life as learning sed, but to say you will
never use it again always makes me laugh.

I would add that if you had have advised a little more on what you had and what you needed, there may have been those on this site that could have made better suggestions than to solve an issue
that is not really related. I am not trying to pick on you specifically, but this is a good question to point out that if you do not supply the correct information you will not get very useful answers.

Glad you found a solution You might also wish to look up dos2unix command for future use on altering Windows based text files.

I would finish by saying that awk (or a higher level language) could have easily perform all your tasks in a single script.
 
  


Reply

Tags
awk, joining, line, lowercase, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Change all uppercase to lowercase with vi?? ufmale Linux - Newbie 4 07-05-2013 05:16 AM
UPPERCASE or lowercase for Variables? bigbankclub Programming 2 03-03-2013 07:59 PM
Lowercase to Uppercase stellafrank Linux - Software 2 11-16-2006 06:46 AM
Converting Uppercase to Lowercase AMMullan Linux - Software 6 10-18-2005 08:32 PM
Converting lowercase to uppercase noodle123 Programming 1 05-17-2002 03:12 PM


All times are GMT -5. The time now is 08:34 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration