LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   using sed to trim lines greater than maximum number of characters (https://www.linuxquestions.org/questions/linux-software-2/using-sed-to-trim-lines-greater-than-maximum-number-of-characters-897477/)

cxny 08-15-2011 03:56 PM

using sed to trim lines greater than maximum number of characters
 
Hi all,

I'm new to this so my knowledge is very limited and need much help.

I need to change a single line of text to contain 52 characters or less.
*All lines have at least 2 chars. in them- no blanks ones.

I need to only use GNU sed for this because I want to continue the following mess:

Code:

s/[^A-Za-z0-9_ `@#$+-=,.'(){}//g;s/  */ /g;s/^ *//;s/ *$//
1st part keeps only certain chars.,i.e. no *!"<>[], etc.
2nd part changes multiple spaces to just one. (The 's/ */ /g' part has 2 spaces followed by 1)
3rd part gets rid of leading spaces.
4th part gets rid of trailing spaces.

Everything works but I'm missing the 5th part to trim the result to 52 chars. or less which must be done last. Or actually, I should probably trim trailing spaces last 'cause I can't have 'em.

btw, if there's a way I can better combine all this stuff, don't hesitate to tell me!

Thanks in advance!

crts 08-15-2011 05:36 PM

Hi and welcome to LQ,

try this to trim the line:
Code:

sed 's/\(.\{,52\}\).*/\1/'
Let me know if you have trouble incorporating it into the solution you have got so far.

grail 08-15-2011 11:07 PM

Well I would probably add that maybe you could look at the exclusion list compared to your inclusion list and see which is shorter. Also, and yes I red the part about ONLY sed, but worth mentioning is awk could handle a few things for you
to give you less to change, namely the handling of multiple spaces and leading and trailing white space (just a suggestion)

cxny 08-16-2011 12:44 AM

Quote:

Originally Posted by crts (Post 4443920)
Hi and welcome to LQ,

try this to trim the line:
Code:

sed 's/\(.\{,52\}\).*/\1/'
Let me know if you have trouble incorporating it into the solution you have got so far.

Whoa, wasn't expecting such a quick response. It worked perfectly!

I ended up putting your part 4th as I thought I would have to. So now the meat of it looks like this:
Code:

s/[^A-Za-z0-9_ `',;@#$+-=}{]//g;s/  */ /g;s/^ *//;s/\(.\{,52\}\).*/\1/;s/ *$//
btw, I Googled and searched this site for an answer before posting but found nothing too helpful.
Is there a "real" help manual for sed anywhere, or is this just a Regex thing?

Anyway, thanks a lot!

grail 08-16-2011 01:01 AM

I would say the bulk of this is probably regex, but the following is a fairly good resource anyway:

http://www.grymoire.com/Unix/Sed.html

cxny 08-16-2011 01:13 AM

@grail:

Thanks for the suggestion but I'm actually inserting this into an existing Windows batch script of all things (using GNU sed for Windows) which is why I couldn't use anything else.

This was used to renames files before processing by other existing scripts. Here's part of it:
Code:

...
if exist *.htm (
  :: generate new file names
  for /F "skip=5 tokens=5" %%S in ('dir /X *.htm') do (
    if "%%S" NEQ "free" (
      for /F "delims=" %%L in ('dir /B "%%S"') do (
        for /f "tokens=*" %%N in ('echo "%%~nL" ^| "%~dp0sed.exe" "s/[^A-Za-z0-9_ `',;@#$+-=}{]//g;s/  */ /g;s/^ *//;s/\(.\{,52\}\).*/\1/;s/ *$//"') do (
          ... save the new names ...
        )
      )
    )
  )
  ... rename the files ...
)
...

Ever try renaming files with poison chars. using a Windows batch?!:banghead:
Don't ask. :p

ShadowCat8 08-16-2011 08:21 PM

As a thought, depending on the version of RegEx you have available to you in your environment, for your first section, how about using something like:
Code:

s/[[:punct:]]//g
to strip all the punctuation characters from the line?

Now, I'm pretty sure that the syntax is correct, but, if it doesn't work in your situation, it should be close enough to give you an idea where to take it to clean up your script a bit more. For RegEx, there are a few POSIX character classes that you can use to help you get to what you are looking for faster:
Code:

[:digit:]        Only the digits 0 to 9
[:alnum:]        Any alphanumeric character 0 to 9 OR A to Z or a to z.
[:alpha:]        Any alpha character A to Z or a to z.
[:blank:]        Space and TAB characters only.
[:xdigit:]        Hexadecimal notation 0-9, A-F, a-f.
[:punct:]        Punctuation symbols . , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _ { } | ~
[:print:]        Any printable character.
[:space:]        Any whitespace characters (space, tab, NL, FF, VT, CR). Many system abbreviate as \s.
[:graph:]        Exclude whitespace (SPACE, TAB). Many system abbreviate as \W.
[:upper:]        Any alpha character A to Z.
[:lower:]        Any alpha character a to z.
[:cntrl:]        Control Characters NL CR LF TAB VT FF NUL SOH STX EXT EOT ENQ ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC IS1 IS2 IS3 IS4 DEL.

And, as an example, to strip ALL punctuation and whitespace characters, you can use:
Code:

s/[[:punct:][:space:]]//g
HTH.

cxny 08-17-2011 03:34 PM

Unfortunately, they wanted to keep as many as possible, so I needed to remove just the 'poison' ones that are invalid for file names in Windows:
Code:

~!/\:?"<>|
And some were removed to not break other scripts, especially: ~!()

but thanks for the tip ShadowCat8, might come in handy someday.

archtoad6 09-02-2011 08:15 PM

Why not remove just the "poison" ones like this:
Code:

sed 's,[~!/\:?"<>|],,g'
(Note: I was taught to use ',' as my std. delimiter)

Consider '+' rather than '*' in removing extra spaces:
Code:

sed -r 's, +, ,g'
If that last did its job, then the following will suffice for leading spaces:
Code:

sed 's,^ ,,'

cxny 09-03-2011 01:19 PM

Quote:

Originally Posted by archtoad6 (Post 4460079)
Why not remove just the "poison" ones like this:
Code:

sed 's,[~!/\:?"<>|],,g'
(Note: I was taught to use ',' as my std. delimiter)

The problem is that I cannot use the poison chars. within a windows batch because it would break the script. For instance, the > would be interpreted as redirection, the | as a pipe, the ! just causes havoc, etc. Also, GNU sed for Windows needs parameters to be in quotes, i.e. sed "s/[^A-Za-z0-9...]" so I couldn't specify the quote symbol either. This is why I had to choose the 'NOT' method instead.

Quote:

Originally Posted by archtoad6 (Post 4460079)
Consider '+' rather than '*' in removing extra spaces:
Code:

sed -r 's, +, ,g'

Being that I wanted it all on one line, I found I couldn't specify the -r option in between commands.

Quote:

Originally Posted by archtoad6 (Post 4460079)
If that last did its job, then the following will suffice for leading spaces:
Code:

sed 's,^ ,,'

Makes sense, you saved me a character! :cool:

Thanks for your input, much appreciated.

cxny 09-03-2011 01:25 PM

btw, my final code for this section turned out like this:
Code:

if exist *.htm (
        :: generate new filenames (remove bad chars.,superfluous spaces and limit filename size)
        for /F "skip=5 tokens=5" %%S in ('dir /X *.htm') do (
                if "%%S" NEQ "free" (
                        for /F "delims=" %%L in ('dir /B "%%S"') do (
                                for /f "tokens=*" %%N in ('echo "%%~nL" ^| "%~dp0sed.exe" "s/[^A-Za-z0-9_ `',;@#$+-=}{]//g;s/  */ /g;s/^ //;s/\(.\{,52\}\).*/\1/;s/ *$//"') do (
                                        CALL :RENAME_Files "%%S" "%%L" "%%N"
                                )
                        )
                )
        )
)



All times are GMT -5. The time now is 05:13 PM.