LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Bash: Trouble converting files from dos to unix format (https://www.linuxquestions.org/questions/programming-9/bash-trouble-converting-files-from-dos-to-unix-format-4175607472/)

andrew.comly 06-07-2017 10:42 AM

Bash: Trouble converting files from dos to unix format
 
TITLE
ABS Ch16-19DictLookupDef

INTRO
I am reading Advanced Bash Scripting by Mendell Cooper. In this eBook there is a Chapter 16 External Filters, Programs and Commands. Within this chapter there is Example 16-19 Looking up definitions in Webster's 1913 Dictionary, which wants the reader to download Webster's Dictionary 1913 (1st 100 pages).

I first tried this program, but it didn't work for me giving the following error:

========================================================================
Problem 1: [[: not found
========================================================================
Code:

        ll /usr/share/dict/webster1913-dict.txt
        -rw-rw-r-- 1 a a 1.5M  3月 23  2012 /usr/share/dict/webster1913-dict.txt
       
        sh Ch16-19DictLookupDef2.sh Abbey
        Ch16-19DictLookupDef2.sh: 20: Ch16-19DictLookupDef2.sh: [[: not found
        1st parameter detected!

Workaround 1
I quickly fixed this specific error by
  1. changing all double brackets into single brackets
  2. changing the "Definition"
    FROM
    Code:

    Definition=$(fgrep -A $MAXCONTEXTLINES "$1 \\" "$dictfile")
    TO
    Code:

    Definition=$(fgrep -A $MAXCONTEXTLINES "$1" "$dictfile"
{SOLVED}


========================================================================
Problem 2: dictionary - non-ASCII characters
========================================================================
Before running this program, I then check /usr/share/dict/webster1913-dict.txt:
Code:

ll /usr/share/dict/webster1913-dict.txt
-rw-rw-r-- 1 a a 1.5M  3月 23  2012 /usr/share/dict/webster1913-dict.txt

I then cat it and look at some of its content. Quite strangely it doesn't consist solely of ASCII characters, examples taken from the first 66 lines:
  1. ½
  2. ØAb¶aÏca (?),
  3. AÏbac¶iÏnate

Workaround 2
I go and then fix the content for about 4 words along with their definitions, and then I run the program.

Code:

$ /usr/local/bin/practice/ABS/Ch16-19DictLookupDef3.sh Ape
1st parameter detected!
Ape (?), n. [AS. apa; akin to D. aap, OHG. affo, G. affe, Icel. api, Sw. apa, Dan. abe, W. epa.] 1. (Zo”l.) A quadrumanous mammal, esp. of the family Simiad‘, having teeth of the same number and form as in man, having teeth of the same number and form as in man, and possessing neither a tail nor cheek pouches. The name is applied esp. to species of the genus Hylobates, and is sometimes used as a general term for all Quadrumana. The higher forms, the gorilla, chimpanzee, and ourang, are often called anthropoid apes or man apes.
 The ape of the Old Testament was probably the rhesus monkey of India, and allied forms.

========================================================================

PREMISES
I guess I had to make these changes because of the comment lines 9-10: "Convert it from DOS to UNIX format (with only LF at end of line)
before using it with this script.".

QUESTION
How to convert it to UNIX format? I tried the utility dos2unix below, but it says that the dictionary file downloaded from project Gutenberg is a binary file!?
Code:

        $ dos2unix webster1913-dict.txt
        dos2unix: Binary symbol 0x15 found at line 35
        dos2unix: Skipping binary file webster1913-dict.txt

Any better way of fixing this than my above workaround for Example 16-19. Looking up definitions in Webster's 1913 Dictionary?

norobro 06-07-2017 12:02 PM

  1. Try the -f option of dos2unix https://waterlan.home.xs4all.nl/dos2....htm#f---force
  2. Load the file into vim then do:
    Code:

    :set ff=unix
    :w


rknichols 06-07-2017 12:16 PM

Quote:

Originally Posted by andrew.comly (Post 5720150)
I first tried this program, but it didn't work for me giving the following error:

========================================================================
Problem 1: [[: not found
========================================================================
Code:

        ll /usr/share/dict/webster1913-dict.txt
        -rw-rw-r-- 1 a a 1.5M  3月 23  2012 /usr/share/dict/webster1913-dict.txt
       
        sh Ch16-19DictLookupDef2.sh Abbey
        Ch16-19DictLookupDef2.sh: 20: Ch16-19DictLookupDef2.sh: [[: not found
        1st parameter detected!


You need to run that with "bash", not "sh". When invoked with the name "sh", bash tries to mimic historical versions of sh that do not support "[[ ... ]]".

Laserbeak 06-07-2017 04:22 PM

If you cat it and get that strange text, then it doesn't seem like that's a plain text file.


Usually to convert a MS-DOS text file to a UNIX text file, you'd just have to do something like this with it:

Code:

#!/usr/bin/perl
$/ = '';
$_ = <>;
s/\r\n/\n/gs;
print;

./abovecode < msdos.txt > unix.txt

Ramurd 06-08-2017 11:17 AM

The issue with many "DOS" files is also the encoding they use may be different from the encoding you use.
You can try to find the encoding with the command 'file'
Code:

file /my/file.txt
may get this result:
Quote:

/my/file.txt: ISO-8859 text, with very long lines, with CRLF line terminators
For example, if you use UTF-8 you can convert it with iconv:
Code:

iconv -f ISO-8859-15 -t UTF-8 -o /my/file.utf8.txt /my/file.txt

andrew.comly 06-09-2017 12:42 AM

no success
 
Quote:

Originally Posted by norobro (Post 5720180)
  1. Try the -f option of dos2unix https://waterlan.home.xs4all.nl/dos2....htm#f---force
  2. Load the file into vim then do:
    Code:

    :set ff=unix
    :w


Thanks, below I my results with above advice:
Code:

$ ll 247*.txt
-rw-rw-r-- 1 a a 1.5M  3月 22  2012 247-0.txt
$ dos2unix -f 247-0.txt
dos2unix: converting file 247-0.txt to Unix format ...
$ vim 247-0.txt
:set ff
{fileformat=unix              1,1          Top}
:wq
$ ll 247*.txt
-rw-rw-r-- 1 a a 1.5M  6月  9 13:21 247-0.txt

But still the dictionary file has non-ASCII characters in it, most especially in the words, e.g.
  1. Ab¶botÏship (?), n. [Abbot + Ïship.] The state or office of an abbot.
  2. AbÏbre¶viÏate (?), v.t. [imp. & p.p. Abbreviated (?); p.pr. & vb.n. Abbreviating.] [L. abbreviatus, p.p. of abbreviare; ad + breviare to shorten, fr. brevis short. See Abridge.] 1. To make briefer; to shorten; to abridge; to reduce by contraction or omission, especially of words written or spoken.
    It is one thing to abbreviate by contracting, another by cutting off.
    Bacon.
    2. (Math.) To reduce to lower terms, as a fraction.
    AbÏbre¶viÏate (?), a. [L. abbreviatus, p.p.] 1. Abbreviated; abridged; shortened. [R.] ½The abbreviate form.¸
    Earle.
    2. (Biol.) Having one part relatively shorter than another or than the ordinary type.
  3. AbÏbre¶viÏate, n. An abridgment. [Obs.]
    Elyot.
  4. AbÏbre¶viÏa·ted (?), a. Shortened; relatively short; abbreviate.
  5. AbÏbre·viÏa¶tion (?), n. [LL. abbreviatio: cf. F. abbr‚viation.] 1. The act of shortening, or reducing.
    2. The result of abbreviating; an abridgment.
    ...
  6. AbÏbre¶viÏa·tor (?), n. [LL.: cf. F. abbr‚viateur.] 1. One who abbreviates or shortens.
    2. One of a college of seventyÐtwo officers of the papal court whose duty is to make a short minute of a decision on a petition, or reply of the pope to a letter, and afterwards expand the minute into official form.

andrew.comly 06-09-2017 12:52 AM

bash
 
Quote:

Originally Posted by rknichols (Post 5720187)
You need to run that with "bash", not "sh". When invoked with the name "sh", bash tries to mimic historical versions of sh that do not support "[[ ... ]]".

thanks a lot, now when I run that with the original version of this program, there is no more
Code:

[[: not found
error message.

How dependable is "sh" to test backward compatibility with older machines?

rknichols 06-09-2017 08:50 AM

Quote:

Originally Posted by andrew.comly (Post 5720822)
How dependable is "sh" to test backward compatibility with older machines?

Going back through history, there have been a lot of programs that have been called "sh". There is no way a simple switch could make bash behave exactly like all of them.

andrew.comly 06-09-2017 06:41 PM

attempt of Laserbreak's solution
 
Quote:

Originally Posted by Laserbeak (Post 5720261)
Usually to convert a MS-DOS text file to a UNIX text file, you'd just have to do something like this with it:
Code:

#!/usr/bin/perl
$/ = '';
$_ = <>;
s/\r\n/\n/gs;
print;

./abovecode < msdos.txt > unix.txt

Laserbreak,

Below is my attempt of your proposal:
Code:

$ vim convert_msdos-UNIX.sh
#!/usr/bin/perl
$/ = '';
$_ = <>;
s/\r\n/\n/gs;
print;

{:wq}

Code:

$ ll convert_msdos-UNIX.sh-rw-rw-r-- 1 a a 55  6月  9 15:37 convert_msdos-UNIX.sh
$ chmod +x convert_msdos-UNIX.sh
$ bash convert_msdos-UNIX.sh < 247-0.txt > 247-0_unix.txt
convert_msdos-UNIX.sh: line 2: $/: No such file or directory
convert_msdos-UNIX.sh: line 3: syntax error near unexpected token `;'
convert_msdos-UNIX.sh: line 3: `$_ = <>;'

{No Success Yet}

andrew.comly 06-09-2017 06:48 PM

Ramurd's solution - Attempt
 
Quote:

Originally Posted by Ramurd (Post 5720548)
You can try to find the encoding with the command 'file'
Code:

file /my/file.txt
may get this result:


For example, if you use UTF-8 you can convert it with iconv:
Code:

iconv -f ISO-8859-15 -t UTF-8 -o /my/file.utf8.txt /my/file.txt

_____________________________________
Ramurd,

Below is my attempt of your proposal:
Code:

$ file 247-0.txt
247-0.txt: data

This format type 'data' is not what your proposal calls for, but thinking syntaxally I then subbed in 'data' for 'ISO-8859':
Code:

$ iconv -f data -t UTF-8 -o ./247-0.txt ./247-0-UTF8.txt
iconv: conversion from `data' is not supported
Try `iconv --help' or `iconv --usage' for more information.

{No Success Yet}

Any ideas what to do for format type 'data'?

Laserbeak 06-09-2017 08:34 PM

Quote:

Originally Posted by andrew.comly (Post 5721116)
Laserbreak,

Below is my attempt of your proposal:
Code:

$ vim convert_msdos-UNIX.sh
#!/usr/bin/perl
$/ = '';
$_ = <>;
s/\r\n/\n/gs;
print;

{:wq}

Code:

$ ll convert_msdos-UNIX.sh-rw-rw-r-- 1 a a 55  6月  9 15:37 convert_msdos-UNIX.sh
$ chmod +x convert_msdos-UNIX.sh
$ bash convert_msdos-UNIX.sh < 247-0.txt > 247-0_unix.txt
convert_msdos-UNIX.sh: line 2: $/: No such file or directory
convert_msdos-UNIX.sh: line 3: syntax error near unexpected token `;'
convert_msdos-UNIX.sh: line 3: `$_ = <>;'

{No Success Yet}

It's a perl program not a bash or sh program.

NevemTeve 06-10-2017 07:07 AM

@OP: Sorry I've lost track somewhere. What is the actual question? If it is related with a file, examine it with a hex-viewer, eg:

Code:

echo 'árvíztűrő tükörfúrógép' >sample
adcr sample
od -tx1 sample
0000000 e1 72 76 ed 7a 74 fb 72 f5 20 74 fc 6b f6 72 66
0000020 fa 72 f3 67 e9 70 0d 0a
iconv -f iso-8859-2 -t utf-8 sample >sample_u
od -tx1 sample_u

0000000 c3 a1 72 76 c3 ad 7a 74 c5 b1 72 c5 91 20 74 c3
0000020 bc 6b c3 b6 72 66 c3 ba 72 c3 b3 67 c3 a9 70 0d
0000040 0a


wpeckham 06-10-2017 07:27 AM

Before using a hex viewer, see if the magic number gives you information on the file. The command is
Code:

file /usr/share/dict/webster1913-dict.txt
and the output should tell you something.

Issue here, dos2unix will do the conversion but it assumes that the file IS in DOS text mode. This file appears to have encoding that is not the simple text that these utilities assume. You will need to find out WHAT it is to determine what conversions or mapping may be done.

rknichols 06-10-2017 08:45 AM

Quote:

Originally Posted by wpeckham (Post 5721262)
Before using a hex viewer, see if the magic number gives you information on the file. The command is
Code:

file /usr/share/dict/webster1913-dict.txt
and the output should tell you something.

Already done in #10, with the result: "247-0.txt: data".

It should be no great surprise that an old dictionary contains non-ASCII characters showing how words are pronounced.

wpeckham 06-10-2017 03:09 PM

Quote:

Originally Posted by rknichols (Post 5721283)
Already done in #10, with the result: "247-0.txt: data".

Well there you go. Tools for properly reformatting text files are going to have undefined behavior if you use them on data files. The difference between DOS and UNIX format text is not the issue because this is not text.

The question then becomes "can the apps you are using properly use a dictionary file of this particular data format?" and if the answer is "no" then you have a more interesting problem. You may need to change apps to one that can use this dictionary, find a dictionary for your app, or find a converter SPECIFIC to the dictionary formats to do the conversion.


All times are GMT -5. The time now is 12:40 AM.