Length and Size of String in Encoding - C# #153

ShervanN · 2022-11-20T11:21:28Z

ShervanN
Nov 20, 2022

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.Unicode;

class Sample
{
    public static void My(string text, Encoding encoding)
    {
        Console.WriteLine(text);
        encoding = Encoding.UTF8;
        byte[] bytes = encoding.GetBytes(text);
        Console.WriteLine($"GetBytesCount: {encoding.GetByteCount(text)}");
        Console.WriteLine($"Length: {text.Length}");
        for (int i = 0; i < bytes.Length; i++)
        {
            Console.Write($"{bytes[i],-7}");
        }
        Console.WriteLine();
        Console.WriteLine($"Preamble: {encoding.GetPreamble().Length}");
        byte[] bytesPreamble = encoding.GetBytes(text);
        for (int i = 0; i < encoding.GetPreamble().Length; i++)
        {
            Console.Write($"{bytesPreamble[i],-7}");
        }
        Console.WriteLine("\n--------\n");
    }
    public static void Main()
    {
        string text = "My🡪";

        Console.WriteLine("ASCI");
        My(text, Encoding.ASCII);

        Console.WriteLine("Unicode");
        My(text, Encoding.Unicode);

        Console.WriteLine("UTF-8");
        My(text,Encoding.UTF8);

        Console.WriteLine("Default");
        My(text, Encoding.Default); 

        Console.WriteLine("UTF-32");
        My(text, Encoding.UTF32);
    }
}

I wrote the above code.

I have three questions:

Could you please explain how 240 159 161 170 sequence was calculated for arrow (->)?
I read ASCII just 8-bit and it doesn't cover many symbols, emojis, and ..... But in my code, ASCII is the same as the other encoding, and was able to encode -> . Could you please explain about it?
Why GetBytesCount is 6? Is it size? The size of char in C# is 2 bytes, If it is the size of "My->", it should be 8 bytes. I'm confused!

(I cannot upload an image here, I uploaded in upload center, please see the below image from my Console app)
https://ibb.co/sKJhrnV

Thanks in advance

Answered by christiannagel

Nov 21, 2022

When you write the preamble to the console, you don't access the bytes returned from encoding.GetPreamble(). Instead, you access the converted bytes. The GetBytes method doesn't add a preamble.
You can add the preamble yourself - if you would like to use this. For single strings you don't really need to do this. It's useful to prefix a complete text which you store on disk or send across the network, so it can easily be accessed again - also on other platforms.

In Chapter 18, Files and Streams (section Analyzing Text File Encodings on page 497 you can read something about the different encodings and the reason for these byte order marks (BOM).

What you now get with ASCII is 4 bytes repres…

View full answer

christiannagel · 2022-11-20T18:37:07Z

christiannagel
Nov 20, 2022
Maintainer

With this line

        encoding = Encoding.UTF8;

you overwrite the encoding passed to the method to always use UTF8. If you remove this line, you'll get different results.

With the string you pass, you pass one character for the arrow: 🡪 is one character contrary to -> which are two characters. That's why you see 6 bytes.

1 reply

ShervanN Nov 21, 2022
Author

Thank you, I know you are busy. Do you have time to explain the output? It's kind of you.
I am confused, I cannot understand how these numbers are calculated in different encodings in C#.
Please see the image, I uploaded in the image upload center.
https://ibb.co/YDMzsXC

christiannagel · 2022-11-21T19:30:19Z

christiannagel
Nov 21, 2022
Maintainer

When you write the preamble to the console, you don't access the bytes returned from encoding.GetPreamble(). Instead, you access the converted bytes. The GetBytes method doesn't add a preamble.
You can add the preamble yourself - if you would like to use this. For single strings you don't really need to do this. It's useful to prefix a complete text which you store on disk or send across the network, so it can easily be accessed again - also on other platforms.

In Chapter 18, Files and Streams (section Analyzing Text File Encodings on page 497 you can read something about the different encodings and the reason for these byte order marks (BOM).

What you now get with ASCII is 4 bytes representing 4 characters. M is translated to the ASCII capital letter M (77, 0x4D). y is translated to the small letter y (121, 0x79). The Unicode symbol 🡪 cannot be translated, so you get two question marks (63, 0x3F).

With the Unicode (UTF-16) encoding 8 bytes are returned which matches two bytes for each character. 0077 or 0x004D is M. UTF-16 started originally with 2 bytes for each character. However, it turned out that 2 bytes are not always enough, so one character can be two bytes or 4 bytes.

With UTF-8 you get one or more bytes for each character. That's why for this string, 6 bytes are returned. Check the Codepage layout at UTF-8.
77 0x4D and 121 0x79 show the same result. 240 0xF0 is a leading byte which needs to be followed by continuation bytes. 159 0x9F is the first continuation byte, 161 0xA1 the second continuation byte, and 170 0xAA the third continuation byte - thus these 4 bytes represent one character.

1 reply

ShervanN Jan 8, 2023
Author

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Length and Size of String in Encoding - C# #153

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Length and Size of String in Encoding - C# #153

ShervanN Nov 20, 2022

Replies: 2 comments · 2 replies

christiannagel Nov 20, 2022 Maintainer

ShervanN Nov 21, 2022 Author

christiannagel Nov 21, 2022 Maintainer

ShervanN Jan 8, 2023 Author

ShervanN
Nov 20, 2022

Replies: 2 comments 2 replies

christiannagel
Nov 20, 2022
Maintainer

ShervanN Nov 21, 2022
Author

christiannagel
Nov 21, 2022
Maintainer

ShervanN Jan 8, 2023
Author