Bash: Trouble converting files from dos to unix format
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I first tried this program, but it didn't work for me giving the following error:
========================================================================
Problem 1: [[: not found
========================================================================
Code:
ll /usr/share/dict/webster1913-dict.txt
-rw-rw-r-- 1 a a 1.5M 3月 23 2012 /usr/share/dict/webster1913-dict.txt
sh Ch16-19DictLookupDef2.sh Abbey
Ch16-19DictLookupDef2.sh: 20: Ch16-19DictLookupDef2.sh: [[: not found
1st parameter detected!
Workaround 1
I quickly fixed this specific error by
changing all double brackets into single brackets
changing the "Definition"
FROM
Code:
Definition=$(fgrep -A $MAXCONTEXTLINES "$1 \\" "$dictfile")
TO
Code:
Definition=$(fgrep -A $MAXCONTEXTLINES "$1" "$dictfile"
{SOLVED}
========================================================================
Problem 2: dictionary - non-ASCII characters
========================================================================
Before running this program, I then check /usr/share/dict/webster1913-dict.txt:
Code:
ll /usr/share/dict/webster1913-dict.txt
-rw-rw-r-- 1 a a 1.5M 3月 23 2012 /usr/share/dict/webster1913-dict.txt
I then cat it and look at some of its content. Quite strangely it doesn't consist solely of ASCII characters, examples taken from the first 66 lines:
eÐ
½
ØAb¶aÏca (?),
AÏbac¶iÏnate
Workaround 2
I go and then fix the content for about 4 words along with their definitions, and then I run the program.
Code:
$ /usr/local/bin/practice/ABS/Ch16-19DictLookupDef3.sh Ape
1st parameter detected!
Ape (?), n. [AS. apa; akin to D. aap, OHG. affo, G. affe, Icel. api, Sw. apa, Dan. abe, W. epa.] 1. (Zo”l.) A quadrumanous mammal, esp. of the family Simiad‘, having teeth of the same number and form as in man, having teeth of the same number and form as in man, and possessing neither a tail nor cheek pouches. The name is applied esp. to species of the genus Hylobates, and is sometimes used as a general term for all Quadrumana. The higher forms, the gorilla, chimpanzee, and ourang, are often called anthropoid apes or man apes.
The ape of the Old Testament was probably the rhesus monkey of India, and allied forms.
PREMISES
I guess I had to make these changes because of the comment lines 9-10: "Convert it from DOS to UNIX format (with only LF at end of line)
before using it with this script.".
QUESTION
How to convert it to UNIX format? I tried the utility dos2unix below, but it says that the dictionary file downloaded from project Gutenberg is a binary file!?
Code:
$ dos2unix webster1913-dict.txt
dos2unix: Binary symbol 0x15 found at line 35
dos2unix: Skipping binary file webster1913-dict.txt
I first tried this program, but it didn't work for me giving the following error:
========================================================================
Problem 1: [[: not found
========================================================================
Code:
ll /usr/share/dict/webster1913-dict.txt
-rw-rw-r-- 1 a a 1.5M 3月 23 2012 /usr/share/dict/webster1913-dict.txt
sh Ch16-19DictLookupDef2.sh Abbey
Ch16-19DictLookupDef2.sh: 20: Ch16-19DictLookupDef2.sh: [[: not found
1st parameter detected!
You need to run that with "bash", not "sh". When invoked with the name "sh", bash tries to mimic historical versions of sh that do not support "[[ ... ]]".
The issue with many "DOS" files is also the encoding they use may be different from the encoding you use.
You can try to find the encoding with the command 'file'
Code:
file /my/file.txt
may get this result:
Quote:
/my/file.txt: ISO-8859 text, with very long lines, with CRLF line terminators
For example, if you use UTF-8 you can convert it with iconv:
$ ll 247*.txt
-rw-rw-r-- 1 a a 1.5M 3月 22 2012 247-0.txt
$ dos2unix -f 247-0.txt
dos2unix: converting file 247-0.txt to Unix format ...
$ vim 247-0.txt
:set ff
{fileformat=unix 1,1 Top}
:wq
$ ll 247*.txt
-rw-rw-r-- 1 a a 1.5M 6月 9 13:21 247-0.txt
But still the dictionary file has non-ASCII characters in it, most especially in the words, e.g.
Ab¶botÏship (?), n. [Abbot + Ïship.] The state or office of an abbot.
AbÏbre¶viÏate (?), v.t. [imp. & p.p. Abbreviated (?); p.pr. & vb.n. Abbreviating.] [L. abbreviatus, p.p. of abbreviare; ad + breviare to shorten, fr. brevis short. See Abridge.] 1. To make briefer; to shorten; to abridge; to reduce by contraction or omission, especially of words written or spoken.
It is one thing to abbreviate by contracting, another by cutting off.
Bacon.
2. (Math.) To reduce to lower terms, as a fraction.
AbÏbre¶viÏate (?), a. [L. abbreviatus, p.p.] 1. Abbreviated; abridged; shortened. [R.] ½The abbreviate form.¸
Earle.
2. (Biol.) Having one part relatively shorter than another or than the ordinary type.
AbÏbre¶viÏate, n. An abridgment. [Obs.]
Elyot.
AbÏbre¶viÏa·ted (?), a. Shortened; relatively short; abbreviate.
AbÏbre·viÏa¶tion (?), n. [LL. abbreviatio: cf. F. abbr‚viation.] 1. The act of shortening, or reducing.
2. The result of abbreviating; an abridgment.
...
AbÏbre¶viÏa·tor (?), n. [LL.: cf. F. abbr‚viateur.] 1. One who abbreviates or shortens.
2. One of a college of seventyÐtwo officers of the papal court whose duty is to make a short minute of a decision on a petition, or reply of the pope to a letter, and afterwards expand the minute into official form.
You need to run that with "bash", not "sh". When invoked with the name "sh", bash tries to mimic historical versions of sh that do not support "[[ ... ]]".
thanks a lot, now when I run that with the original version of this program, there is no more
Code:
[[: not found
error message.
How dependable is "sh" to test backward compatibility with older machines?
How dependable is "sh" to test backward compatibility with older machines?
Going back through history, there have been a lot of programs that have been called "sh". There is no way a simple switch could make bash behave exactly like all of them.
This format type 'data' is not what your proposal calls for, but thinking syntaxally I then subbed in 'data' for 'ISO-8859':
Code:
$ iconv -f data -t UTF-8 -o ./247-0.txt ./247-0-UTF8.txt
iconv: conversion from `data' is not supported
Try `iconv --help' or `iconv --usage' for more information.
Before using a hex viewer, see if the magic number gives you information on the file. The command is
Code:
file /usr/share/dict/webster1913-dict.txt
and the output should tell you something.
Issue here, dos2unix will do the conversion but it assumes that the file IS in DOS text mode. This file appears to have encoding that is not the simple text that these utilities assume. You will need to find out WHAT it is to determine what conversions or mapping may be done.
Already done in #10, with the result: "247-0.txt: data".
Well there you go. Tools for properly reformatting text files are going to have undefined behavior if you use them on data files. The difference between DOS and UNIX format text is not the issue because this is not text.
The question then becomes "can the apps you are using properly use a dictionary file of this particular data format?" and if the answer is "no" then you have a more interesting problem. You may need to change apps to one that can use this dictionary, find a dictionary for your app, or find a converter SPECIFIC to the dictionary formats to do the conversion.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.