LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Joining line ending with lowercase and starting with lowercase, or uppercase (https://www.linuxquestions.org/questions/linux-newbie-8/joining-line-ending-with-lowercase-and-starting-with-lowercase-or-uppercase-4175578709/)

hurd 04-30-2016 04:18 PM

Joining line ending with lowercase and starting with lowercase, or uppercase
 
Hi,

Fistly
I would like to join lines ending with lowercase and starting with lowercase, but with newline between them, like this:

Quote:

this is

my lines
I would like to get that:

Quote:

this is my lines
Secondly
I would like to join lines ending with lowercase and starting with uppercase, but with newline between them, like this (when there is no period):

Quote:

this is

My lines
I would like to get that:

Quote:

this is My lines

I'm not a pro in Sed, but I use it sometimes (for easy stuff); I tried something like this (find on the net) for cases where there are no newline between:
Code:

sed -r ':a;N;$!ba;s/\n([^A-Z])/ \1/g' file.txt
I do not understand everything in the above command, but I think I can reuse it for getting the result I need.

Thirdly
How can I delete each line in capital, starting or ending with a number?
Thanks for any help.

tshikose 04-30-2016 05:10 PM

Hi,

So actually, there is no difference between lower case or upper case new line starting?

tshikose 04-30-2016 05:11 PM

Or you want

Quote:

This is

My lines
to be come
Quote:

his is
My lines
???

hurd 05-01-2016 01:08 AM

No, there is no difference between lower case and upper case new line starting.

And I really want to join these lines, like I said in the first post.

Thank you.

syg00 05-01-2016 02:42 AM

Not a good idea to go using commands you don't understand. here is the online documentation. Time spent reading it will be worthwhile. Have a look at the "s" command and "Other commands" sections to work out what the command you cited in the original post does. Then you can modify it to suit what you want to do ow.

hurd 05-01-2016 02:44 AM

OK for lower case
 
Well I succeed for lower case:

I have CR characters in my original files, I convert them to LF and now it works for lower case:
Code:

sed -r ':a;N;$!ba;s/\n\n([^A-Z])/ \1/g' file.txt
But remains two questions :
1) How to join lines ending in lower case with lines starting with upper case, where there is no period between them.

2) How can I delete each line in capital letters, starting or ending with a number?
Like this:
Quote:

THIS LINE NUMBER 3
or
Quote:

3 THIS LINE
Thank you.

grail 05-01-2016 03:39 AM

You have a few questions you need to ask yourself as some of your data is now overlapping.
Code:

this is

THIS LINE NUMBER 3

With the above example, your first code will want to join the lines, but do you then delete the entire line because it has capitals and a number, do you delete it first before joining so you just end up with 'this is', if you do, what happens to the extra line that was in between?

I agree with syg00 that you should first go and read what the sed line you have does and also work out if you need multiple sed's piped together or in a single script???

AwesomeMachine 05-01-2016 05:17 AM

Try using tr.

allend 05-01-2016 10:02 AM

It appears that you are trying to parse a double spaced text file prepared in Windows.
Your needs are still unclear to me, but perhaps 'sed' is the wrong tool for the task.
Given an input file
Quote:

this is.

this is

my lines

this is

My lines

THIS LINE NUMBER 3

3 THIS LINE

this is

my lines

this is.

My lines

THIS LINE NUMBER 3

3 THIS LINE

this is.
then 'awk' can be used to produce the output
Quote:

this is.
this is my lines this is My lines this is my lines this is.
My lines this is.
If this is what you want, then the problem breaks down to:
1. Skip blank lines
2. Skip a line starting with a number
3. Skip a line ending with a number
4. If a line does not end with a period, concatenate with any previous input
5. If a line ends with a period, concatenate with any previous input and print.
This can be done with an awk script containing appropriate /regular expression/ patterns and actions.

DavidMcCann 05-01-2016 10:41 AM

If it is a Windows text file, the problem is simply that it contains both line feeds and unwanted carriage returns. In that case, loading into an editor may give you the opportunity to ignore the CRs.

hurd 05-01-2016 12:37 PM

Quote:

Originally Posted by syg00 (Post 5538909)
Not a good idea to go using commands you don't understand. here is the online documentation. Time spent reading it will be worthwhile. Have a look at the "s" command and "Other commands" sections to work out what the command you cited in the original post does. Then you can modify it to suit what you want to do ow.

Hi,

You right, but I spend to much time to learn each time something new, even for something that I will use only once in my life, that I do not have enough time to live normally :s (no kidding, it is obsessive).

So I refrain myself (very hard) from learning somethings that will not benefit me for long time.

But perhaps, I should learn a little more Sed.

Finally, it is what I did (I have to admit, not enough), and I have almost finished all I need to do.

1. I cleaned the files from not matching newlines (CR+LF to LF)
Code:

for i in *.htm ; do tr '\r\n' '\n' < $i > tr-$i;  done;
(Then I removed the first untouched files from the working directory)

2. Then join lines (as needed)
Code:

for i in *.htm ; do sed -r ':a;N;$!ba;s/\n([^A-Z])/ \1/g' $i
3. Put <p> in the beginning and </p> at the end of each sentence starting with uppercase, where can follow lowercase letters, symbols and numbers (for matching the text) :
Code:

for i in *.htm ; do sed -i -r 's/.*[A-Z0-9(]+*[a-z0-9].*[a-z.:? ]$/\<p\>&/' $i ;  done;
for i in *.htm ; do sed -i -r '/^<p>/ s/$/\<\/p\>/' $i ;  done;

4. Then enclosing each title (Words in Capital) with h2 tags
Code:

sed -r 's/^.*[A-Z0-9].*[A-Z0-9]$/\<h2\>&/'
sed -r '/^<h2>/ s/$/\<\/h2\>/'

I need to do this, because I took on Internet content (From OCR: these files seem to have been formatted on a Windows OS) from scans of very rare, interesting, and "pretty" old books (may be today only available in India).

In fact, I want to make an ebook from theses files.

Now I only need to find some way to clean some stuffs, but it should be OK.

Finally, a long manual work for cleaning some OCR mistakes.

Thank you.

grail 05-01-2016 01:29 PM

Quote:

I spend to much time to learn each time something new, even for something that I will use only once in my life
I learnt how to add once in my life ... funnily enough I still need it now and then, this is not to say learning arithmetic is on the same level for life as learning sed, but to say you will
never use it again always makes me laugh.

I would add that if you had have advised a little more on what you had and what you needed, there may have been those on this site that could have made better suggestions than to solve an issue
that is not really related. I am not trying to pick on you specifically, but this is a good question to point out that if you do not supply the correct information you will not get very useful answers.

Glad you found a solution :) You might also wish to look up dos2unix command for future use on altering Windows based text files.

I would finish by saying that awk (or a higher level language) could have easily perform all your tasks in a single script.


All times are GMT -5. The time now is 06:16 PM.