Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression - RepresenterError on dict subclass #142

Open
jaraco opened this issue Mar 12, 2018 · 11 comments
Open

Regression - RepresenterError on dict subclass #142

jaraco opened this issue Mar 12, 2018 · 11 comments

Comments

@jaraco
Copy link

jaraco commented Mar 12, 2018

Due to #126, I'm running pyyaml at master. When I try to run code that passes on released versions of pyyaml, it fails in master:

-> raise RepresenterError("cannot represent an object", data)
(Pdb) type(data)
<class 'pmxbot.dictlib.ConfigDict'>
(Pdb) isinstance(data, dict)
True

As the object is a subclass of dict, it would previously be treated as a dict for the purpose of serialization. Is this an intentional change?

@jaraco
Copy link
Author

jaraco commented Mar 12, 2018

Here's a more direct example of what's newly failing:

$ .tox/python/bin/python
iPython 3.7.0b2 (tags/v3.7.0b2:b0ef5c979b, Feb 27 2018, 20:38:21)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> class mydict(dict):
...   pass
...
>>> import yaml
>>> yaml.dump(mydict(a=1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jaraco/Dropbox/code/yg/public/pmxbot/.tox/python/lib/python3.7/site-packages/yaml/__init__.py", line 217, indump
    return dump_all([data], stream, Dumper=Dumper, **kwds)
  File "/Users/jaraco/Dropbox/code/yg/public/pmxbot/.tox/python/lib/python3.7/site-packages/yaml/__init__.py", line 196, indump_all
    dumper.represent(data)
  File "/Users/jaraco/Dropbox/code/yg/public/pmxbot/.tox/python/lib/python3.7/site-packages/yaml/representer.py", line 26, in represent
    node = self.represent_data(data)
  File "/Users/jaraco/Dropbox/code/yg/public/pmxbot/.tox/python/lib/python3.7/site-packages/yaml/representer.py", line 57, in represent_data
    node = self.yaml_representers[None](self, data)
  File "/Users/jaraco/Dropbox/code/yg/public/pmxbot/.tox/python/lib/python3.7/site-packages/yaml/representer.py", line 229,in represent_undefined
    raise RepresenterError("cannot represent an object", data)
yaml.representer.RepresenterError: ('cannot represent an object', {'a': 1})

@jaraco
Copy link
Author

jaraco commented Mar 12, 2018

Here's the previous behavior, observed on Python 3.6 and PyYAML 3.12:

$ python3.6 -m rwt pyyaml==3.12
Collecting pyyaml==3.12
  Using cached PyYAML-3.12.tar.gz
Installing collected packages: pyyaml
  Running setup.py install for pyyaml ... done
Successfully installed pyyaml-3.12
Python 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import yaml
>>> class MyDict(dict): pass
...
>>> yaml.dump(MyDict(a=1))
'!!python/object/new:__main__.MyDict\ndictitems: {a: 1}\n'

@pradyunsg
Copy link

This might be since it seems to have switched to safe-dumping by default.

https://github.com/yaml/pyyaml/blob/master/lib3/yaml/dumper.py#L27

@jaraco
Copy link
Author

jaraco commented Jun 6, 2018

So... does that mean the next release will be backward-incompatible? Can you recommend a workaround that would retain the 3.12 behavior?

@pradyunsg
Copy link

I'm not a maintainer. Landed here from the pmxbot issue. :)

I'd guess you can try to get the "DangerousDumper" (or similarly named) class and if it's not present, fall back to the existing Dumper class when doing the dump.

That'll work now and in the future even if the next release is backwards incompatible.


IMO, the current behaviour on master is the correct behaviour. By default PyYAML should be using the "safe" dumper and loader.

@jaraco
Copy link
Author

jaraco commented Oct 19, 2019

I tried using the dangerous dumper, but found that creates a YAML file that requires a dangerous loader. What I really want is to instruct YAML to serialize a dict subclass exactly as it would a dict.

@dhimmel
Copy link

dhimmel commented Nov 24, 2020

I also ran into this issue trying to dump an instance of a dict subclass.

I found an ugly workaround on StackOverflow that you must run before calling yaml.dump:

yaml.add_representer(
    Dict_Subclass,
    lambda dumper, data: dumper.represent_mapping('tag:yaml.org,2002:map', data.items())
)

Limitations:

  1. Dict_Subclass must be the specific dict subclass. This doesn't fix the issue for all dict subclasses, just those explicitly overriden.
  2. It works with yaml.dump but not yaml.safe_dump.

@bluenote10
Copy link

bluenote10 commented Apr 18, 2023

This issue doesn't seem to be limited to dict, but applies to all types that are subclasses of representable types, right? Consider for instance:

import json
import yaml

class StringSubclass(str):
    ...

class IntSubclass(int):
    ...

Both of these fail:

print(yaml.safe_dump(StringSubclass("foo")))
print(yaml.safe_dump(IntSubclass(42)))

This is unfortunate, because it is inconsistent with the behavior of json.dump which can serialize types just have, if their base class is serializable. So these two work just fine:

print(json.dumps(StringSubclass("foo")))
print(json.dumps(IntSubclass(42)))

The difference in behavior is unfortunate, because it means that (untagged "safe") yaml serialization cannot just be used in places where plain json serialization works fine. Imho safe dumping should mirror the behavior of json here and allow serialization if the base type is serializable anyway. Otherwise, the necessary work-around on user side when switching from json to yaml can get pretty ugly.

@nitzmahone
Copy link
Member

Yep, I've gotten nailed by this myself plenty of times. In addition to the stuff going on in native Python, IIRC the Cython-libyaml CDumper also has some hard-coded type checks that would need to be re-thought in order to enable this kind of fallback behavior. At least for the "Python-aware" dumpers, we'd probably also have to have this kind of thing be an opt-in behavior, or make some changes and add tests to ensure that more-specific types are always selected first (so that folks that are expecting actual round-trippable serialization of arbitrary Python types still get the right type tagged on the output).

This all probably gets a little easier to manage once we merge something like #700, since it's really mostly still just about which types get registered as representers rather than wildly different behavior that would necessitate actual new Dumper subclasses.

@bluenote10
Copy link

so that folks that are expecting actual round-trippable serialization of arbitrary Python types still get the right type tagged on the output

At least for the SafeDumper this wouldn't be the expected behavior, right? By definition the output of the safe dumper should be free of tags, and thus roundtrips with custom types isn't expected to work. Nonetheless inheritance is a is a relationship, and if a value is a string the safe representer should represent it as such.

If we consider the main use case of the non-safe dumper to be perfect roundtrips, this dumper should probably stay as it is?

I did a quick experiment: Replacing the following occurrences of add_representer by an add_multi_representer seems to solve this issue:

SafeRepresenter.add_representer(str,
SafeRepresenter.represent_str)
SafeRepresenter.add_representer(bytes,
SafeRepresenter.represent_binary)
SafeRepresenter.add_representer(bool,
SafeRepresenter.represent_bool)
SafeRepresenter.add_representer(int,
SafeRepresenter.represent_int)
SafeRepresenter.add_representer(float,
SafeRepresenter.represent_float)
SafeRepresenter.add_representer(list,
SafeRepresenter.represent_list)
SafeRepresenter.add_representer(tuple,
SafeRepresenter.represent_list)
SafeRepresenter.add_representer(dict,
SafeRepresenter.represent_dict)
SafeRepresenter.add_representer(set,
SafeRepresenter.represent_set)
SafeRepresenter.add_representer(datetime.date,
SafeRepresenter.represent_date)
SafeRepresenter.add_representer(datetime.datetime,
SafeRepresenter.represent_datetime)

The downside would be a slightly worse performance, because the logic would then always have to go into the second branch with the iteration here:

data_types = type(data).__mro__
if data_types[0] in self.yaml_representers:
node = self.yaml_representers[data_types[0]](self, data)
else:
for data_type in data_types:
if data_type in self.yaml_multi_representers:
node = self.yaml_multi_representers[data_type](self, data)
break

To get back to the performance as it is now, we could add a multi representer to the normal representers as well (i.e., adding it to both dicts in line 75) so that the logic here would again use branch line 48 in the majority of cases.

IIRC the Cython-libyaml CDumper also has some hard-coded type checks that would need to be re-thought in

Oh I wasn't aware of that other implementation, so I'm not sure how the change suggested above would relate to that. If it could work like I suggested, may I open a PR?

@nitzmahone
Copy link
Member

nitzmahone commented Apr 20, 2023

At least for the SafeDumper this wouldn't be the expected behavior, right?

Correct, but it's really a little more nuanced than that, especially in light of what's going on with #700 and YAML 1.2 support, since the entire concept of "safe" and "not safe" gain a few more dimensions. The problem is making sure that we stack the representers in the right order so that anything that subclasses them and/or adds further multi-representers doesn't get hidden by the fallback. #700 makes that a little more visible and explicit, but still probably not quite enough to prevent folks from being bitten by an overly "greedy" default fallback that masks something they added explicitly.

Feel free to play around with some ideas and impls on a draft PR if you like, but we're probably not going to actually merge anything that changes this behavior until after the YAML 1.2 support is nailed down, since it potentially introduces some new variables to consider. Just want to set expectations 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants