Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to a path interface instead of using std::string for file paths #51

Open
ni-nkozlowski opened this issue Dec 11, 2020 · 5 comments · May be fixed by #53
Open

Switch to a path interface instead of using std::string for file paths #51

ni-nkozlowski opened this issue Dec 11, 2020 · 5 comments · May be fixed by #53
Assignees

Comments

@ni-nkozlowski
Copy link
Contributor

On Windows, conversions between std::string and path interfaces results in UTF-8 strings being converted to ANSI, resulting in non-ANSI characters being lost in the conversion process. This results in files not being loaded correctly because they'll fail our valid path checks.

@marcrambo
Copy link
Collaborator

AFAIK these utf8 conversion problems are all fixed already. We added some extensive tests to ensure files with unicode characters can be loaded. The folder ni-media/audiostream/test/test_files/user_files contains all sorts of files with unicode characters... and the unit tests loads these files as expected

'L1b_mp3-123456789äöü□'$'\204''□'$'\226''□'$'\234''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.mp3'
'L1m4a-123456789äöü□'$'\204''□'$'\226''□'$'\234''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'').m4a'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.aac.m4a'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.aiff'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.alac.m4a'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.flac'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.mp3'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.ogg'
'unicode123456789äöü□'$'\204''□'$'\226''□'$'\234''□'$'\237''!@#$%^&_(□'$'\203\225''□'$'\202''□k□'$'\202''□□'$'\201\214'')-ah7Hfds.wav'

Perhaps the call site is missing the proper locale initialization? See https://github.com/NativeInstruments/ni-media/blob/master/audiostream/test/ni/media/test_helper.cpp#L46. This is needed to properly interface with windows APIs (which use utf16 encoded wchar_t). Deep down in boost::iostreams, the sources / sinks will convert char <-> wchar_t and expect the proper conversion to take place.

I am not sure switching to std::filesystem::path will change anything to that behavior.

Maybe we should document this as a requirement on Windows? Or maybe we should use our own facet inside of ni-media, but I don't know if this even works, and if it's a good idea. Tbh I lost a bit track about the current state of utf8 in cpp, and what's best practice. Actually, I just noticed that https://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16 is already deprecated in C++17... that didn't last long :)

@ni-nkozlowski
Copy link
Contributor Author

CC @C4rsten

Perhaps the call site is missing the proper locale initialization?

Yes. This is how we fixed it on the user side. If we use our own facet, it's one less detail the user has to worry about, so I would be up for that. Though it seems like we would have to roll our own, as you pointed out std::codecvt has already been deprecated. 🤷

There's a PR open right now related to this issue as well where a path interface was added, but it also swaps out the backend from using boost::filesystem to std::filesystem.

@ni-nkozlowski ni-nkozlowski linked a pull request Mar 29, 2021 that will close this issue
@marcrambo
Copy link
Collaborator

Yes I agree, this is a hidden requirement and totally not obvious for clients. I would treat this issue separately from the C++ 17 std::filesystem support though.

I would suggest the following approach:

  1. We add custom utf8 <-> utf16 conversion functions inside of ni-media and handle all the path <-> string conversions ourselves -> Clients will not need to imbue boost::filesystem globally anymore. However they will still be responsible for providing utf8 encoded path strings. Internally we will continue using boost::filesystem::path. ni-media will still be c++14 and clients stuck on that compiler version also benefit from this fix.
  2. We switch from boost::filesystem::path to std::filesystem::path internally. This issue might block us Make boost::iostreams::detail::path constructible from std::filesystem::path boostorg/iostreams#110 from replacing it entirely
  3. We add the std::filesystem::path to the public interfaces (What Frank already started in Bump to C++17 for std::filesystem #53)

@franklange
Copy link

Heyhey, thanks for looking into this!

I would like to help execute this but I'm not yet 100% clear on how these steps would look like exactly.

For Step 1 you're envisioning something like this? (really not sure about the formatting here)

class aiff_file_source : public aiff_source<boost::iostreams::file_descriptor_source>
{
    using base_type = aiff_source<boost::iostreams::file_descriptor_source>;
#if BOOST_OS_WINDOWS
    using converter = std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>>;
#endif

public:
    explicit aiff_file_source( const std::string& path )
#if BOOST_OS_WINDOWS
    : base_type( converter{}.from_bytes(path), BOOST_IOS::binary | BOOST_IOS::in )
    {}
#else
    : base_type( path, BOOST_IOS::binary | BOOST_IOS::in )
    {}
#endif
};

Step 2 is then using fs::path internally:

class aiff_file_source : public aiff_source<boost::iostreams::file_descriptor_source>
{
    using base_type = aiff_source<boost::iostreams::file_descriptor_source>;

public:
    explicit aiff_file_source( const std::string& path )
    : base_type( std::filesystem::u8path(path), BOOST_IOS::binary | BOOST_IOS::in )
    {}
};

and Step 3 is then just using fs::path as the input param type as done in the PR but without using u8path.

Is that somewhat along the lines of what you had in mind or am I completely going in the wrong direction?

@C4rsten
Copy link

C4rsten commented Apr 6, 2021

As stated initially I would avoid string conversions entirely. Instead on Windows use std::wstring and deprecate the std::string interface to not break compatibility. I agree with the following steps: adding std::filesystem support eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants