utf-8 encode

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
I need a little help with the Encode charset.
I can't see (in terminal) my folders that contain special characters like: ç é áao etc ...

I've used utf-8 now and iso-8859 but to no avail until then.
Has anyone had this problem and could give me some help? I need to solve this because it is unable to sync my data with s3.

Name of folder example:

ISO-8859-1
ADMISS�O DE FUNCIONARIO

UTF-8
ADMISS?O DE FUNCIONARIO



environment variables
LANG=pt_PT.ISO8859-1
MM_CHARSET=ISO-8859-1



file : .login_conf
Code:
me:\
    :charset=ISO-8859-1:\
    :lang=pt_PT.ISO8859-1:
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
Hi, i am not using S3 so i might be completely wrong here but could you provide full output of locale (or locale -a). Maybe some of the other envs is wrong/not-set?
 

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
hello, thank you for your attention, follow the information.
My environment variables are these, I made some changes because of sensitive files.

Code:
SSH_CONNECTION=198.51.100.216 46814 198.51.100.192 22
LANG=pt_BR.UTF-8
MM_CHARSET=UTF-8
USER=test
PWD=/Arquivos/Infra
HOME=/home/test
SSH_CLIENT=198.51.100.216 46814 22
SSH_TTY=/dev/pts/0
MAIL=/var/mail/test
SHELL=/usr/local/bin/bash
TERM=xterm-256color
SHLVL=1
BLOCKSIZE=K
LOGNAME=test
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin:/home/test/bin
_=/usr/bin/env
OLDPWD=/Arquivos


This is output of the command locale
Code:
LANG=pt_BR.UTF-8
LC_CTYPE="pt_BR.UTF-8"
LC_COLLATE="pt_BR.UTF-8"
LC_TIME="pt_BR.UTF-8"
LC_NUMERIC="pt_BR.UTF-8"
LC_MONETARY="pt_BR.UTF-8"
LC_MESSAGES="pt_BR.UTF-8"
LC_ALL=
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
So you have set .login_conf to pt_PT.ISO8859-1 but lang still returns pt_BR.UTF-8 even after logoff/login. If so then i guess you haven't run cap_mkdb to regenerate the binary config file(s). Give it a try, then re-log and check locale again.

EDIT: Once the above is done could you please run following command in the directory where you see wrong names and provide output:
find . -name "ADMIS*" -type f -print0 | hexdump -C
 
Last edited:

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
Oops, sorry I just did some testing and forgot to warn the changes, so I'm using the following settings in the login.conf file:

Code:
me:\
    :charset=UTF-8:\
    :lang=pt_BR.UTF-8:


and already used the command cap_mkdb ~/.login.conf and I logged off
and this is the current command output locale
Code:
LANG=pt_BR.UTF-8
LC_CTYPE="pt_BR.UTF-8"
LC_COLLATE="pt_BR.UTF-8"
LC_TIME="pt_BR.UTF-8"
LC_NUMERIC="pt_BR.UTF-8"
LC_MONETARY="pt_BR.UTF-8"
LC_MESSAGES="pt_BR.UTF-8"
LC_ALL=



And, when i tryied use the command find . -name "ADMIS*" -type f -print0 | hexdump -C in the directory where it contains the wrongly named folders i not get return.
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
Ah sorry they're directories, in that case the command is find . -name "ADMIS*" -type d -print0 | hexdump -C
 

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
ok
this is the command output.

Code:
00000000  2e 2f 41 44 4d 49 53 53  c3 4f 20 44 45 20 46 55  |./ADMISS.O DE FU|
00000010  4e 43 49 4f 4e 41 52 49  4f 00 2e 2f 41 44 4d 49  |NCIONARIO../ADMI|
00000020  53 53 41 4f 20 44 45 20  45 53 54 41 47 49 41 52  |SSAO DE ESTAGIAR|
00000030  49 4f 00                                          |IO.|
00000033
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
Okay the c3 is a valid char for UTF-8 which eliminates the issues with unsupported characters.

Now .. do you know under which encoding the directory was originaly created? Was it the ISO-8859-1 or something else?
 

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
well, i can't tell you if it was in iso-8859-1 what i can say is that the clients that use these folders and change it are mac0s and windows clients but the highest usage percentage is windows. I also took a look at the configuration of freenas but precisely speaking of the configuration of the SAMBA service, and noticed that it is as follows.
NOTE: The users access these directories with service samba.
Captura de tela de 2020-01-09 15-05-51.png
 

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
When i tried change the field UNIX charset iso-8859-1 to UTF-8 the directories that have accent and special characters simply disappeared from the client interface, I got scared and went back to iso-8859-1 and the files came up again.
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
Okay thanks ... My plain guess is that it is actually issue of the session rather than something else. I assume users using these directories sees the name properly over SMB so the main "issue" here is that you don't see it properly when connected to the server CLI.

So my next question is "How are you connecting to the server." I mean are you directly connected to console or via SSH? If SSH what is the client. If PuTTy do you have proper charset configured?

1578594597051.png


If this is the source of the issue i suggest to change the "locale" back to its original values (whatever it was) and use the same set in your SSH client. Then you should be able to see the UTF8 chars just fine ...

UTF-8 set in PuTTy
1578595259026.png


To demonstrate the difference here is the same dir but with 8859-2 ... as you can see the names are screwed and one file is "missing"
1578595419828.png


And one more comment for the UNIX charset ... you CAN NOT change the charset just like that. If you do so then the charset used for "decoding" would not match the charset under which the files/dirs were created. If you really want to change the charset used by SMB you have to convert all of the files/dirs encoding. To do so you can use convmv which handles conversion. Make sure that you have a proper backup and you know the original encoding as wrong usage of the convmv could cause wayyyy more issues.

EDIT:
Erwww forgot about the initial question is related to S3 upload. How exactly are you sending the files? Can you try it from CLI with debug enabled? aws --debug --region=.......
 
Last edited:

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
So my next question is "How are you connecting to the server." I mean are you directly connected to console or via SSH? If SSH what is the client. If PuTTy do you have proper charset configured?
About this question, I'm connecting via ssh on an ubuntu 19:04

And one more comment for the UNIX charset ... you CAN NOT change the charset just like that. If you do so then the charset used for "decoding" would not match the charset under which the files/dirs were created. If you really want to change the charset used by SMB you have to convert all of the files/dirs encoding. To do so you can use convmv which handles conversion. Make sure that you have a proper backup and you know the original encoding as wrong usage of the convmv could cause wayyyy more issues.

Thanks for this tip, there are actually years files maybe stored with totally different encodes so it is risky to convert them all.
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
About this question, I'm connecting via ssh on an ubuntu 19:04
What is output of locale of your Ubuntu system? If you use proper locale for the SSH connection you should see the characters properly. That i would say is the first requirement to move forward.
Or if you don't want to mess with the locales then you could install the PuTTy and set the connection charset as per the screen in my prev. post.

Thanks for this tip, there are actually years files maybe stored with totally different encodes so it is risky to convert them all.
Well for Linux/Unix clients it should not be issue as SMB converts to whatever is defined in UNIX charset so ISO-8859-1 in your case. Windows makes things more messy (as always) and SMB actually tries to match the charset the client is using while preferring UTF-8. If it fails to match (DOS/Win98/ME era) it uses ASCII (which adds even more screwups for modern days). I assume you don't have any ancient SMB clients (because the back-compatibility with SMB1 is not enabled in your setup) so i really assume you have mostly UTF-8 charset.

So first ... get PuTTy and try SSH with UTF-8. See if you can see the special chars.
Second ... try following command while connected to your server testparm -v | grep -i "charset" and paste output.

Once you have proper character display in your terminal try uploading the files to S3 via console mode ... (aws ....). If it still fails try the debug mode (aws --debug --region=.......) and paste output.
 

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
Well for Linux/Unix clients it should not be issue as SMB converts to whatever is defined in UNIX charset so ISO-8859-1 in your case. Windows makes things more messy (as always) and SMB actually tries to match the charset the client is using while preferring UTF-8. If it fails to match (DOS/Win98/ME era) it uses ASCII (which adds even more screwups for modern days). I assume you don't have any ancient SMB clients (because the back-compatibility with SMB1 is not enabled in your setup) so i really assume you have mostly UTF-8 charset.
understand

So first ... get PuTTy and try SSH with UTF-8. See if you can see the special chars.
Second ... try following command while connected to your server testparm -v | grep -i "charset" and paste output.

i did install the putty but still unsuccessful in visualization.
The output the command testparm -v | grep -i "charset" are in image
Captura de tela de 2020-01-23 09-44-19.png

Once you have proper character display in your terminal try uploading the files to S3 via console mode ... (aws ....). If it still fails try the debug mode (aws --debug --region=.......) and paste output.
so, this command not exist
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
Okay so apparently the UTF-8 is invalid even with GUI allowing that (BUG to be reported) and with this invalid charset it falls to default CP850 (which contains all of the "Latin-1" / 8859-1 chars).

The character à has HEX value "C3" in ISO8859-1 and "C3 83" in UTF-8 which is still OK for this particular one.

1579791832892.png
1579791867223.png


So it seems to me that the locale is just screwed...

Lets get some details about your env (i know some of these were already provided but still):
- Open terminal on your PC and type locale, paste output
- Connect to the server via SSH and again use locale, paste output
- While you're connected check the file names - i assume they're invalid.
- Try to write some special characters (like that "Ã") into the terminal. Are they displaying properly or you see different chars (or just "dots" or "nothing")
- Open PuTTY, set the Remote character set to ISO-8859-1, connect to server
- Check the file names .. are OK, or still broken (the same way like before) or broken but by a different characters.
- Again try to type some of these special characters. Does it work?
- Close the session, open PuTTy again but this time set the encode to UTF-8 and try again the same.

Post results ...

Note: the aws is a CLP for S3. How exactly are you uploading the files to S3?
 
Last edited:

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
- Open terminal on your PC and type locale, paste output
Connect to the server via SSH and again use locale, paste output

i get identical output

Code:
LANG=pt_BR.UTF-8
LC_CTYPE="pt_BR.UTF-8"
LC_COLLATE="pt_BR.UTF-8"
LC_TIME="pt_BR.UTF-8"
LC_NUMERIC="pt_BR.UTF-8"
LC_MONETARY="pt_BR.UTF-8"
LC_MESSAGES="pt_BR.UTF-8"
LC_ALL=



While you're connected check the file names - i assume they're invalid.
yes are with interrogation dot

Try to write some special characters (like that "Ã") into the terminal. Are they displaying properly or you see different chars (or just "dots" or "nothing")
yes when i write via ssh i can see this caracteres

- Open PuTTY, set the Remote character set to ISO-8859-1, connect to server
- Check the file names .. are OK, or still broken (the same way like before) or broken but by a different characters.
i get the same problem to view the names of folders


- Again try to type some of these special characters. Does it work?
yes did worked like if i did use utf-8 to write normaly

- Close the session, open PuTTy again but this time set the encode to UTF-8 and try again the same.
same result i can write with caracteres like à but i yet do not view the name of folders i get interrogation dot " ? " where are this caracteres.
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
So was playing with this for a while and i guess I've replicated your scenario:

SSH with ISO-88559-1
root@freenas[/tmp/test1]# locale
LANG=pt_PT.ISO8859-1
LC_CTYPE="pt_PT.ISO8859-1"
LC_COLLATE="pt_PT.ISO8859-1"
LC_TIME="pt_PT.ISO8859-1"
LC_NUMERIC="pt_PT.ISO8859-1"
LC_MONETARY="pt_PT.ISO8859-1"
LC_MESSAGES="pt_PT.ISO8859-1"
LC_ALL=
root@freenas[/tmp/test1]# touch testÃtest
root@freenas[/tmp/test1]# ls -la | grep test
-rw-r--r-- 1 root wheel 0 24 jan 00:36 testÃtest
root@freenas[/tmp/test1]# find . -name "test*" -type f -print0 | hexdump -C
00000000 2e 2f 74 65 73 74 c3 74 65 73 74 00 |./testÃtest.|

Then it looks like this under SSH with UTF-8 AND with locale set to en_US.UTF-8

root@freenas[/tmp/test1]# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
root@freenas[/tmp/test1]# ls -la | grep test
-rw-r--r-- 1 root wheel 0 Jan 24 00:36 test?test
//Seems like yours issue, right ...


Just for comparison....
SSH with UTF-8 and UTF8 locale
root@freenas[/tmp/test1]# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
root@freenas[/tmp/test1]# touch testÃtest
find . -name "test*" -type f -print0 | hexdump -C
00000000 2e 2f 74 65 73 74 c3 83 74 65 73 74 00 |./test..test.|
// The filename is in UTF-8 thus it says "c3 83" as a HEX value for "Ã"


So i would try this:
- Change the ~/.login_conf to this:
Code:
me:\
    :charset=ISO-8859-1:\
    :lang=pt_PT.ISO8859-1:

- Then call cap_mkdb ~/.login_conf
- Log-Out of the session
- Connect to server via PuTTy with "Remote character set" set to "ISO-8859-1"
1579823308565.png

- validate that "locale" gives you "pt_PT.ISO8859-1" for all env.
- Check that you can now see the filenames properly
 

higorr

Dabbler
Joined
Jan 7, 2020
Messages
12
I did as you guided me by placing it on ISO8859-1 and accessing it via putty, I did all the procedures and checked the locale and i get this:
novoooo.png


and were i connect via ssh on freenas via my terminal i get this

Code:
[hlogin@freenas /mnt/file/tour]$ ls | grep REL*| hexdump -C
00000000  52 45 4c 41 c7 41 4f 20  44 45 20 46 55 4e 43 49  |RELA�AO DE FUNCI|
00000010  4f 4e 41 52 49 4f 53 0a                           |ONARIOS.|
00000018
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
@higorr Wheeee WTF ... okay i have two more to try...

Assuming you have "pt_PT.ISO8859-1" now set as your locale on the server follow the steps bellow
- Connect via PuTTy with ISO-8859-1 remote charset a do:
Code:
locale
mkdir /tmp/test
cd /tmp/test/
touch `echo -e "1test\xc3test"`
touch `echo -e "2test\xc3\x83test"`
ls -la

- Copy (or screenshot) the full console output here (not only results but whole terminal screen including entered commands)
- Change the content of ~/.login_conf to this:
Code:
me:\
    :charset=UTF-8:\
    :lang=en_US.UTF-8:

- Regenerate locale by using cap_mkdb ~/.login_conf
- KEEP the connection open, connect with another PuTTy instance but with UTF-8 as remote charset and do:
Code:
locale
cd /tmp/test/
touch `echo -e "3test\xc3test"`
touch `echo -e "4test\xc3\x83test"`
ls -la

- Again paste the output/screenshot of terminal here
- Now go back to the first PuTTy instance (here you will still have the 8859-1 lang) and while in /tmp/test call again the ls -la and paste another output here.

Result should be that the 1 and 3 are OK under Latin1 env and 2+4 "broken". The second PuTTy with UTF8 should properly show the 2 and 4 names while 1 and 3 will be scrambled. It works on my end flawlessly but lets see your output.

Anyway in the second PuTTy window (the one with UTF-8 charset) do this:
Code:
convmv -f ISO-8859-1 -t UTF-8 1tes*

It gives you this kind of output:
1579969406669.png

Now call it with the --notest flag
Code:
convmv -f ISO-8859-1 -t UTF-8 1tes* --notest

Output will confirm the conversion was done. Now use ls -la and paste the output. You should see 1, 2 and 4 properly while 3 will be still scrambled.

Waiting for your results :]
 
Top