LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 12-28-2022, 11:09 AM   #1
lucmove
Senior Member
 
Registered: Aug 2005
Location: Brazil
Distribution: Debian
Posts: 1,435

Rep: Reputation: 110Reputation: 110
Three questions about file system encoding


1) I have an old hard disk with file names that have accented characters. For example á é ã ô. When I browse that old disk, those characters are truncated.

I thought they should have been encoded in iso-8859-1 because I was a big fan of iso-8859-1 at the time, so I tried to convert them like this:

$ convmv -f iso-8859-1 -t utf8 ./

It didn't work. Neither did cp1252.

So I ran a loop over the output of `convmv --list' and did a dry run with each one of the possible encodings as the -from option. None of them worked.

Question number 1: is there some way to convert and correct those truncated characters in file names?

2) Then I used rsync to copy some very new files to the old disk, and I noticed that many (not all) file names were written enclosed in 'hard quotes' and not all of those have accented characters.

Question number 2: Why are those hard quotes being added to the file names, is it because the old hard disk is formatted with a different encoding?

3) Question number 3: Does "file system encoding" even exist? I thought the OS had its encoding enforced at run time, but file systems were agnostic.

TIA
 
Old 12-28-2022, 02:09 PM   #2
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,468

Rep: Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354
I would say this is less about encoding, and more about locales. Presuming the Portuguese of your location, I would suggest you stay away from iso-8859-1 & cp-1252, which are distinctly unimaginate US codepages, where you don't get accents.

Get information. Look at 'man locale' and get some lists up. There's no problem setting up a Portuguese user with the different locale for that stuff. You need the corredt locale to see the correct names.

Generally, you want to get on Unicode. I have hacked the Irish keymap to get characters not on my keymap. There's some ingenious character definitions like 'dead_abovedot' because old Irish uses an abovedot. I have that mapped to AltGr & h. I press that, and nothing is printed; but the next letter I type has the abovedot - e.g. ̇ḃ.

Last edited by business_kid; 12-28-2022 at 02:13 PM.
 
1 members found this post helpful.
Old 12-28-2022, 04:21 PM   #3
lucmove
Senior Member
 
Registered: Aug 2005
Location: Brazil
Distribution: Debian
Posts: 1,435

Original Poster
Rep: Reputation: 110Reputation: 110
OK, thanks, but UTF-8 has been my locale for more than 10 years and everything works fine... usually. This very old backup hard disk is an exception.

I also know how to map weird characters. I have a bunch mapped to my right-most Windows key for Russian. I was interested in learning Russian a couple of years ago, but gave up on that very fast. Anyway, my real problem now is recursively fixing a large number of broken file names.
 
Old 12-29-2022, 12:04 AM   #4
lvm_
Member
 
Registered: Jul 2020
Posts: 991

Rep: Reputation: 349Reputation: 349Reputation: 349Reputation: 349
what do you mean by 'truncated'? Are they missing? Garbled? Replaced with non-accented characters? Can you show an example of ls and ls|hexdump? In general, there is no such thing as 'filesystem encoding', filenames are interpreted according to the current session-level locale, so you have to guess the correct one and set LANG/LC_* variables accordingly.

Re Q2: ls always adds single quotes to filenames with delimiters which otherwise will be split into multiple parts in subsequent processing. Use 'ls -N' to list literal file names.
 
Old 12-29-2022, 04:04 AM   #5
lucmove
Senior Member
 
Registered: Aug 2005
Location: Brazil
Distribution: Debian
Posts: 1,435

Original Poster
Rep: Reputation: 110Reputation: 110
Quote:
Originally Posted by lvm_ View Post
what do you mean by 'truncated'? Are they missing? Garbled? Replaced with non-accented characters? Can you show an example of ls and ls|hexdump? In general, there is no such thing as 'filesystem encoding', filenames are interpreted according to the current session-level locale, so you have to guess the correct one and set LANG/LC_* variables accordingly.

Re Q2: ls always adds single quotes to filenames with delimiters which otherwise will be split into multiple parts in subsequent processing. Use 'ls -N' to list literal file names.
Spacefm file manager shows it like this:

https://0x0.st/oRoX.png

Samme for pcmanfm:

https://0x0.st/oRo8.png

On 'ls' output, it shows as '2009 ACORDO ORTOGR?FICO 2009 - tabela.doc'

The correct form would be '2009 ACORDO ORTOGRÁFICO 2009 - tabela.doc'

I suspected there is no such thing as filesystem encoding. Now I see what happened is that I used a different locale at the time the file was written. Most likely iso-8859-1.
 
Old 12-29-2022, 04:43 AM   #6
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,468

Rep: Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354
Quote:
Originally Posted by lucmove
Anyway, my real problem now is recursively fixing a large number of broken file names.
Glad you know your way around. Are you sure you're not just viewing them in the wrong codepage, Locale or Whatever? If the files are named wrongly, why not rename them? Are you sure the disk is OK?

While I think of it Codepage 437 used to be an old chesnut from Dos It was one of the standard lines in autoexec.bat (if memory serves correctly) That might have varied in other places. I didn't see CP-1252 until windows 95(?).
 
Old 12-29-2022, 05:29 AM   #7
lucmove
Senior Member
 
Registered: Aug 2005
Location: Brazil
Distribution: Debian
Posts: 1,435

Original Poster
Rep: Reputation: 110Reputation: 110
Quote:
Originally Posted by business_kid View Post
Glad you know your way around. Are you sure you're not just viewing them in the wrong codepage, Locale or Whatever? If the files are named wrongly, why not rename them? Are you sure the disk is OK?

While I think of it Codepage 437 used to be an old chesnut from Dos It was one of the standard lines in autoexec.bat (if memory serves correctly) That might have varied in other places. I didn't see CP-1252 until windows 95(?).
There are too many of them in directories and subdirectories. Doing that manually would be a good punishment sentence.

There is no DOS or Windows involved. It's all Linux, ext3. I believe I used Ubuntu at the time, but could have been Slackware.
 
Old 12-29-2022, 10:20 AM   #8
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,468

Rep: Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354Reputation: 2354
One thing you could do is look at the file names with a hex editor. Standard ascii is from 31 to 127. Unicode & other stuff goes up to 255.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] linux functions, from time to time giving series of three beeps, repeated three times Yury_T Linux - General 7 11-19-2014 01:27 PM
Chinese encoding not encoding in kate linuxmandrake Linux - Software 1 12-12-2010 08:50 AM
Three Users, Three Window Managers - How? GNewbie Ubuntu 10 03-19-2007 01:23 AM
LXer: Sun Microsystems Delivers Three-in-One Punch to the Competition With World Record Results on Three Operating Systems, Including Solaris 10 LXer Syndicated Linux News 0 02-07-2006 09:46 PM
Three years later, three steps back RAnthony General 1 09-11-2004 04:35 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 12:06 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration