[SOLVED] Strip unwanted carriage returns

business_kid · 04-25-2016, 07:54 AM

A simple (apparently) problem. I want to strip carriage returns(0x0a) from plain text. I also need to out column breaks & form feeds on some things. This fails to do it

Code:

sed 's/\x0a/\x20/g' -i some_file.txt

But the instruction format works for other unwanted hex characters in the text. I am not familiar with awk or perl. I have vim installed, & nano, Sigil, Calibre & Libreoffice if magic there will work. I don't understand why sed does not.

Didier Spaier · 04-25-2016, 07:59 AM

fromdos --help

EDIT But anyway carriage return (CR) is not 0A (that is line feed or LF) but 0D, see for instance http://www.utf8-chartable.de/unicode...ble.pl?names=2

Also I don't know any character named "column break". What is that?

Paulo2 · 04-25-2016, 08:32 AM

Isn't carriage-return 0x0d?
As Didier pointed, it is a DOS thing

Code:

echo -e '\x0d'|cat -A
^M$
echo -e '\x0a'|cat -A
$
$

55020 · 04-25-2016, 09:04 AM

I suspect your sed substitution just needs '-e', then you can zap any other stuff like FF in the same command.

aaazen · 04-25-2016, 09:53 AM

Here are some commands that might be helpful.

$man ascii

$hexdump -C [filename]

$fromdos < [dostextfile] > [unixtextfile]

$todos < [unixtextfile] > [dostextfile]

tronayne · 04-25-2016, 10:23 AM

Within vi, you can do the non-printing character trick to remove all carriage returns:

Open the text file that contains the return characters and enter

Code:

:g/Ctrl-VCtrl-M/s///g

Ctrl-V allows you to insert "control" characters in a file, Ctrl-M is the carriage return; you simply substitute the CR with nothing. Note that you do not place a space between the two control characters.

Here's a little cross reference of all of the control characters with their names:

Code:

	Dec	Hex	Octal	Binary		ASCII
	000	000	0000	00000000	NUL	(Ctrl-@)
	001	001	0001	00000001	SOH	(Ctrl-A)
	002	002	0002	00000010	STX	(Ctrl-B)
	003	003	0003	00000011	ETX	(Ctrl-C)
	004	004	0004	00000100	EOT	(Ctrl-D)
	005	005	0005	00000101	ENQ	(Ctrl-E)
	006	006	0006	00000110	ACK	(Ctrl-F)
	007	007	0007	00000111	BEL	(Ctrl-G)
	008	008	0010	00001000	BS	(Ctrl-H)
	009	009	0011	00001001	HT	(Ctrl-I)
	010	00a	0012	00001010	NL	(Ctrl-J)
	011	00b	0013	00001011	VT	(Ctrl-K)
	012	00c	0014	00001100	NP	(Ctrl-L)
	013	00d	0015	00001101	CR	(Ctrl-M)
	014	00e	0016	00001110	SO	(Ctrl-N)
	015	00f	0017	00001111	SI	(Ctrl-O)
	016	010	0020	00010000	DLE	(Ctrl-P)
	017	011	0021	00010001	DC1	(Ctrl-Q)
	018	012	0022	00010010	DC2	(Ctrl-R)
	019	013	0023	00010011	DC3	(Ctrl-S)
	020	014	0024	00010100	DC4	(Ctrl-T)
	021	015	0025	00010101	NAK	(Ctrl-U)
	022	016	0026	00010110	SYN	(Ctrl-V)
	023	017	0027	00010111	ETB	(Ctrl-W)
	024	018	0030	00011000	CAN	(Ctrl-X)
	025	019	0031	00011001	EM	(Ctrl-Y)
	026	01a	0032	00011010	SUB	(Ctrl-Z)
	027	01b	0033	00011011	ESC	(Ctrl-[)
	028	01c	0034	00011100	FS	(Ctrl-\)
	029	01d	0035	00011101	GS	(Ctrl-])
	030	01e	0036	00011110	RS	(Ctrl-^)
	031	01f	0037	00011111	US	(Ctrl-_)
	032	020	0040	00100000	SP	(Ctrl-`)

You can use the same trick in the shell; e.g., if your screen becomes unreadable (with all sorts of goofy characters), you can enter Ctrl-VCtrl-N (or Ctrl-VCtrl-O) to, hopefully, recover the screen (you may have shifted-in or shiftend-out which changes character sets) which happens if you cat a non-ASCII file.

Hope this helps some.

business_kid · 04-25-2016, 11:52 AM

Thanks All of you for the replies. To answer some points

Carriage Returns could well be 0x0d; I opened the file with xxd & 0x0a is the offending character. If it's not <CR>, that's my bad.
I know about fromdos & todos.

The vim thing looks great for Carriage returns - what's the magic for a line feed?
EDIT: Oh I see it. Ctrl-J. Thanks, tronayne

I'll get back to work tomorrow and try all those suggestions.

fsbooks · 04-28-2016, 05:58 AM

Being a linefeed editor, I don't think sed can remove new lines. I believe it takes the line, defined by a '\n' at the end (NL, \x00a, etc). It then returns the line, changed by expression(s), with a new line at the end. I would think by definition, the material to be edited would not have a new line, as that would define a new line. For your usage, I'd use tr for a simple solution.

Code:

$ cat ttt
aaa
bbbb
ccccc
$ <ttt od -h
0000000 6161 0a61 6262 6262 630a 6363 6363 000a
0000017
$ <ttt >ttt2 tr '\n' ' '
$ <ttt2 od -h
0000000 6161 2061 6262 6262 6320 6363 6363 0020
0000017
$ cat ttt2
aaa bbbb ccccc $ wc -l ttt2
0 ttt2

Note the od dump has replaced 0a's with 20's, a cat of the file leaves the prompt on the same line, and wc reports 0 lines (trailing NL has also been replaced with a space.

Didier Spaier · 04-28-2016, 06:54 AM

Quote:

Originally Posted by fsbooks

Being a linefeed editor, I don't think sed can remove new lines.

It can.

Code:

/tmp$ sed ":a;N;s/\n/ /;ba" ttt
aaa bbbb ccccc
/tmp$

lazydog · 04-28-2016, 10:55 AM

I use dos2unix and it cleans up the files imported form windows nicely.

vonbiber · 04-29-2016, 12:32 AM

I'm surprised nobody mentioned tr.

Code:

$ tr -d '\r' < input.txt > output.txt

GazL · 04-29-2016, 09:05 AM

The problem with that tr is that it will remove all CRs not just the CR on a CRLF. It's unlikely that you'll often encounter a solitary CR but the possibility is there.

I have used sed -i 's/\r$//' before, but as someone recently pointed out to me '\r' is a gnu extension and not in the POSIX sed implementation, which is something that some people care about.

vonbiber · 04-30-2016, 01:14 AM

Quote:

Originally Posted by GazL

The problem with that tr is that it will remove all CRs not just the CR on a CRLF. It's unlikely that you'll often encounter a solitary CR but the possibility is there.

Yes, you're right.
Actually single CRs used (?) to be MacOS's way of terminating a line.
I wonder if they switched to LFs since they started using BSD.

Didier Spaier · 04-30-2016, 02:39 AM

Quote:

Originally Posted by GazL

The problem with that tr is that it will remove all CRs not just the CR on a CRLF. It's unlikely that you'll often encounter a solitary CR but the possibility is there.

Just to clarify: as other utilities like sed, tr deals with characters, not bytes.

In most used character encoding (ASCII and UTF-8) CR is represented by a single byte, but by two in UCS-2, by 4 in UCS-4 aka UTF-32.

Also, I assume that we are speaking of text files, not binary files.

tronayne · 04-30-2016, 06:08 AM

I've never had a failure, stripping the carriage returns from Windows text files, with this little utility, dos2unx:

Code:

#!/bin/sh
#
# dos2unx file [file...]
#
# Converts text files (names specified on command line) from MS-DOS
# format to UNIX format.  Essentially, gets rid of all newlines (\n),
# since line feeds (\l) are all it needs.

if [ $# -lt 1 ]
then
        echo usage: dos2unx file [file ...]
        exit 1
fi

for FILE
do
        echo -n "dos2unx: converting ${FILE} ... "
        tr -d '\r' < ${FILE} > /tmp/conv$$
        rm -f ${FILE}
        cp -f /tmp/conv$$ ${FILE}
        rm -f /tmp/conv$$
        echo "done"
done

Just save this as dos2unx.sh and

Code:

make dos2unx
mv dos2unx /usr/local/bin

Works just fine (and /usr/local/bin is on your PATH).

Hope this helps some.