LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-26-2023, 05:59 AM   #1
blumenwesen
Member
 
Registered: Jan 2022
Posts: 40

Rep: Reputation: 0
filtering out displayable characters in bash


How can I omit all unreadable or incorrectly displayed characters?

GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.1)

Code:
$ a="֎"
$ printf %x "\""$a
58e(base) # correct

$ echo "֎" | nawk '{print | eval "printf \"%x\" \"\\\""$0"\""}'
d6(base) # wrong

$ echo -e "֎" | awk '{ cmd="printf %x \\\"" $0; cmd | getline hex; close(cmd); print hex }'
d6
(base) # wrong

$ echo -e "֎" | awk '{ system("/usr/bin/printf \"%x\" \\\""$0"") }'
/usr/bin/printf: warning: �: character(s) following character constant have been ignored
d6(base) # wrong


# complete script
$ echo -e "\n\n\n"  | awk '{ split($0, z, ""); for(y=0; y<length(z); y++){ if(system("/usr/bin/fc-list :charset=$(/usr/bin/printf \"%x\" \\\"\""z[y]"\")")){ print z[y] } } }'
/usr/bin/printf: '"': expected a numeric value
/usr/share/fonts/truetype/lyx/wasy10.ttf: wasy10:style=LyX
/usr/share/fonts/truetype/lyx/stmary10.ttf: stmary10:style=LyX
/usr/bin/printf: warning: ��: character(s) following character constant have been ignored
/usr/share/fonts/truetype/lato/Lato-Medium.ttf: Lato,Lato Medium:style=Medium,Regular
# ...  wrong

# How can it be the same result like that?
$ for z in $(echo -e "\n\n\n"); do
$ [[ $(fc-list :charset=$(printf %x "\"$z")) ]] && echo "$z"
$ done

 # correct
Thanks for the help.

Last edited by blumenwesen; 03-26-2023 at 06:03 AM.
 
Old 03-26-2023, 08:13 AM   #2
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,452

Rep: Reputation: 1061Reputation: 1061Reputation: 1061Reputation: 1061Reputation: 1061Reputation: 1061Reputation: 1061Reputation: 1061
Perhaps you can use the [:print:] character class?

How to delete characters that are not in [:print:]

Bash:
Code:
string='text with bad characters'
newstring=${string//[^[:print:]]/}
stdin to stdout with tr:
Code:
tr -dc '[:print:]'
awk stdin:
Code:
{ gsub(/[^[:print:]]/,"") }
You might need to set LC_ALL=C (LC_CTYPE at least) to enforce an ASCII character set.

Last edited by MadeInGermany; 03-27-2023 at 05:13 AM.
 
Old 03-26-2023, 08:32 AM   #3
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,479
Blog Entries: 1

Rep: Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695
Perhaps this:
Code:
echo '֎Szűrő' | iconv -f UTF-8 -t ASCII//IGNORE 2>/dev/null
Szr
Or perhaps:
Code:
echo '֎Szűrő' | iconv -f UTF-8 -t ISO-8859-2//IGNORE 2>/dev/null | iconv -f ISO-8859-2 -t UTF-8
Szűrő

Last edited by NevemTeve; 03-26-2023 at 08:35 AM.
 
Old 03-26-2023, 09:38 AM   #4
blumenwesen
Member
 
Registered: Jan 2022
Posts: 40

Original Poster
Rep: Reputation: 0
The first one is displayed normally as a man in a circle, the second one as a heart with sinus rhythm as well, so it should also appear in the issue.
The third is a bordered box with F21F, the fourth is a bordered box with F220, both should not be displayed.


Code:
$ a=""; [[ $(fc-list :charset=$(printf %x "\"$a")) ]] && echo "$a"
# result: 

$ a=""; [[ $(fc-list :charset=$(printf %x "\"$a")) ]] && echo "$a"
# result: 

$ a=""; [[ $(fc-list :charset=$(printf %x "\"$a")) ]] && echo "$a"
# result: nothing

$ a=""; [[ $(fc-list :charset=$(printf %x "\"$a")) ]] && echo "$a"
# result: nothing

string=''
echo ${string//[^[:print:]]/} # displays all

echo -e "" | iconv -f UTF-8 -t ISO-8859-2//IGNORE 2>/dev/null | iconv -f ISO-8859-2 -t UTF-8 # displays nothing
 
Old 03-26-2023, 12:18 PM   #5
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,479
Blog Entries: 1

Rep: Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695
What do you actually wish to accomplish? Check a character's presence in a font-file?
 
Old 03-26-2023, 01:30 PM   #6
blumenwesen
Member
 
Registered: Jan 2022
Posts: 40

Original Poster
Rep: Reputation: 0
Check a file for characters that cannot be represented symbolically, i.e. filter out the boxes surrounded by hexadecimal numbers.
I wanted to use awk because the lines for the start and end are predefined, and further adjustments are needed.
 
Old 03-26-2023, 11:24 PM   #7
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,479
Blog Entries: 1

Rep: Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695Reputation: 1695
That depends on the used font-file, I guess.
 
Old 03-27-2023, 01:13 AM   #8
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 20,224

Rep: Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834Reputation: 6834
yes, probably the tool (fc-list) uses a different locale, not "compatible" with the actual font. You cannot handle it [easily] in awk.
 
Old 03-27-2023, 07:50 AM   #9
blumenwesen
Member
 
Registered: Jan 2022
Posts: 40

Original Poster
Rep: Reputation: 0
Ok, then I have to try without, thanks anyway. =)
 
Old 03-27-2023, 08:39 AM   #10
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,452

Rep: Reputation: 1061Reputation: 1061Reputation: 1061Reputation: 1061Reputation: 1061Reputation: 1061Reputation: 1061Reputation: 1061
The
Code:
$ printf "%x\n" \"$a
58e
seems to be special to the bash-builtin printf.
The external printf command fails:
Code:
$ /usr/bin/printf "%x\n" \"$a
/usr/bin/printf: warning: Ž: character(s) following character constant have been ignored
d6
And another bash-builtin fails:
Code:
$ echo \"$a
"֎
Avoid system() in awk!
In bash you can try an explicit while loop
Code:
echo -e "\n\n\n" |
while IFS= read -r z
do
  [[ $(fc-list :charset=$(printf "%x" "\"$z")) ]] &&
  echo "$z"
done
The pipe forces the loop into a sub shell. If you want it in the main shell (and give variables back) then use a process substitution
Code:
while IFS= read -r z
do
  [[ $(fc-list :charset=$(printf "%x" "\"$z")) ]] &&
  echo "$z"
done < <( echo -e "\n\n\n" )
Read reads a (newline-separated) line. Missing a split function you might need another loop to iterate over the individual characters.

Last edited by MadeInGermany; 03-27-2023 at 09:10 AM.
 
Old 03-27-2023, 08:50 AM   #11
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,198

Rep: Reputation: 2274Reputation: 2274Reputation: 2274Reputation: 2274Reputation: 2274Reputation: 2274Reputation: 2274Reputation: 2274Reputation: 2274Reputation: 2274Reputation: 2274
Quote:
Originally Posted by blumenwesen View Post
Check a file for characters that cannot be represented symbolically, i.e. filter out the boxes surrounded by hexadecimal numbers.
Awk does not know how your font will choose to render the characters (i.e. bytes / hex numbers) that it will receive.

The characters you've used are F21D, F21E, F21F, F220 - all four of those are within the E000–F8FF Unicode Private Use Area which seems to mean there is no standardized symbol for any of them, and so you need to determine what defines the first two as valid characters in your context.


After that, well, it depends if/how well Awk deals with multi-byte characters - since Gawk only has single-byte escape sequences, you may be better to use (e.g.) Perl or Python that have explicit support.

 
Old 03-27-2023, 02:08 PM   #12
blumenwesen
Member
 
Registered: Jan 2022
Posts: 40

Original Poster
Rep: Reputation: 0
It seems that it can only work with awk if there is an alternative for the hex conversion of printf inside, but I don't know any.

Now I'm going to do it like this, I just thought awk might be faster.

Code:
FILE=$(echo -e "\n\n\n\n\n\n\n" | tail -n +3 | head -n -2)
for z in $FILE; do
[[ $(fc-list :charset=$(printf %x "\"$z")) ]] && echo "$z"
done
# result:



Last edited by blumenwesen; 03-27-2023 at 02:09 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to create displayable images in lieu of font characters waddles Programming 3 10-16-2013 06:16 PM
content filtering with layer7- filtering rose1366m Linux - Networking 1 05-04-2011 11:10 AM
Need help filtering out sed's "special" characters fatsheep Programming 2 11-09-2006 04:54 PM
Sendmail Spam filtering and Virus filtering MrJoshua Linux - General 2 04-03-2003 10:12 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:30 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration