Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should we handle text_data in proto #1186

Open
duglin opened this issue Apr 6, 2023 · 39 comments
Open

How should we handle text_data in proto #1186

duglin opened this issue Apr 6, 2023 · 39 comments

Comments

@duglin
Copy link
Collaborator

duglin commented Apr 6, 2023

See: cloudevents/sdk-go#759

From that issue:


On the use of the text_data and binary_data fields in the protobuf-format spec, it says:

When the type of the data is text, the value MUST be stored in the text_data property.
datacontenttype SHOULD be populated with the appropriate media-type.
When the type of the data is binary the value MUST be stored in the binary_data property.
datacontenttype SHOULD be populated with the appropriate media-type.

The Java SDK's serializer always uses the text_data field when the datacontenttype is application/json, application/xml or text/*. Otherwise the binary_data field is used. However, in the Go SDK's deserializer binary_data is treated literally, while text_data gets serialized using the appropriate serializer for the datacontenttype. When a CloudEvent with a datacontenttype of application/json is serialized into protobuf using the Java SDK and then sent to a service using the Go SDK and deserialized there, this results in the JSON in the text_data field being interpreted as a JSON-String when it should be treated as a JSON-Object. So if I serialize this CloudEvent into protobuf using the Java SDK:

{
    "specversion" : "1.0",
    "type" : "com.example.someevent",
    "source" : "/mycontext",
    "subject": null,
    "id" : "C234-1234-1234",
    "time" : "2018-04-05T17:31:00Z",
    "comexampleextension1" : "value",
    "comexampleothervalue" : 5,
    "datacontenttype" : "application/json",
    "data" : {
        "appinfoA" : "abc",
        "appinfoB" : 123,
        "appinfoC" : true
    }
}

and then send the resulting protobuf object to a service using the Go SDK and deserialize it there back into JSON format, the result is:

{
    "specversion" : "1.0",
    "type" : "com.example.someevent",
    "source" : "/mycontext",
    "subject": null,
    "id" : "C234-1234-1234",
    "time" : "2018-04-05T17:31:00Z",
    "comexampleextension1" : "value",
    "comexampleothervalue" : 5,
    "datacontenttype" : "application/json",
    "data" : "{ \"appinfoA\" : \"abc\", \"appinfoB\" : 123, \"appinfoC\" : true }"
}

This seems to be unintended behavior, but I as far as I can tell, neither the Java nor the Go SDK implementations of the Protobuf format are technically wrong according to the way the spec is written, so I decided to raise this issue here to get clarification. Is this a bug in one or both of these libraries? Should the Protobuf format spec be more specific in how the binary_data and text_data fields should be used? Or is this behavior somehow intended?

@JemDay FYI

@duglin
Copy link
Collaborator Author

duglin commented Apr 6, 2023

This was also discussed on the 2022-07-28 call: https://www.youtube.com/watch?v=cg0jKFwyJeM&list=PLO-qzjSpLN1BEyKjOVX_nMg7ziHXUYwec&index=34

@JemDay
Copy link
Contributor

JemDay commented Apr 6, 2023

I think this is where Mr. @jskeet conformance tests are going to come in useful :-)

As you mention the behavior of the Go SDK is a bit odd, but potentially not incorrect - until you read the fine-print in the JSON Format specification. ;-)

If the datacontenttype declares the data to contain JSON-formatted content, a JSON serializer MUST translate the data value to a JSON value, and use the member name data to store it inside the JSON representation. The data value MUST be stored directly as a JSON value, rather than as an encoded JSON document represented as a string. An implementation MAY fail to serialize the event if it is unable to translate the runtime value to a JSON value.

Given that statement I'd humbly suggest that the Go SDK is not spec compliant, whether all the other SDKs are compliant in this scenario is another question entirely...

@JemDay
Copy link
Contributor

JemDay commented Apr 6, 2023

The bottom line here seems to be that if your application code hands a 'string' to the SDK along with the assertion that it is JSON then the SDK its required to parse the string back into a JSON value and stick that in the data property.

If the parse step fails then the SDK should error-out.

@duglin
Copy link
Collaborator Author

duglin commented Apr 20, 2023

@lionelvillard if we "fixed" the go-lang SDK so that we don't treat this as a string do you think this will break lots of Knative folks?

@duglin
Copy link
Collaborator Author

duglin commented Apr 20, 2023

I might be doing it wrong, but in testing the golang SDK I'm seeing this:

Original EVENT:
{
  "data": {
    "Hello": "there"
  },
  "datacontenttype": "application/json",
  "id": "",
  "source": "example/uri",
  "specversion": "1.0",
  "type": "example.type"
}

What HTTP sees:
{
  "data": {
    "Hello": "there"
  },
  "datacontenttype": "application/json",
  "id": "c6c142aa-3943-42de-96ad-1186e39a1cb6",
  "source": "example/uri",
  "specversion": "1.0",
  "time": "2023-04-20T20:28:53.710827689Z",
  "type": "example.type"
}

Receiving it and serializing it back as JSON:
{
  "data": {
    "Hello": "there"
  },
  "datacontenttype": "application/json",
  "id": "c6c142aa-3943-42de-96ad-1186e39a1cb6",
  "source": "example/uri",
  "specversion": "1.0",
  "time": "2023-04-20T20:28:53.710827689Z",
  "type": "example.type"
}

I think it's working as expected. Perhaps something changed since the original issue was opened.

@jskeet
Copy link
Contributor

jskeet commented Apr 21, 2023

@duglin: I believe the point is that it's probably fine if the SDK parses something that already has a JSON object as the data value, because it presumably preserves that as a JSON object. But if it's just given a string (e.g. because it's transcoding from the proto format) it doesn't try to convert that into a JSON object, even if the content type is application/json.

Here's C# code (using the System.Text.Json formatter, but I'm sure the Newtonsoft.Json one will behave the same way) showing the issue in C#:

using CloudNative.CloudEvents;
using CloudNative.CloudEvents.SystemTextJson;
using System.Text;

var evt = new CloudEvent
{
    Id = "Issue1186",
    Type = "example.type",
    Source = new Uri("example/uri", UriKind.RelativeOrAbsolute),
    Data = "{\"Hello\": \"there\"}",
    DataContentType = "application/json"
};

var formatter = new JsonEventFormatter(new() { WriteIndented = true }, new());
var bytes = formatter.EncodeStructuredModeMessage(evt, out _);
Console.WriteLine(Encoding.UTF8.GetString(bytes.Span));

Output:

{
  "specversion": "1.0",
  "id": "Issue1186",
  "type": "example.type",
  "source": "example/uri",
  "datacontenttype": "application/json",
  "data": "{\u0022Hello\u0022: \u0022there\u0022}"
}

It would be good to know what other SDKs do in the same scenario.

Fundamentally I think this is a problem with model we have of data within our formats and SDKs - each format is opinionated in its own way, violating the model of "data is just a sequence of bytes". That does make for much more readable serialized events - at the cost of this sort of thing. I really don't know where we should go from here.

@duglin
Copy link
Collaborator Author

duglin commented Apr 21, 2023

That example I used was based on the first comment in the issue, here's another example passing in a string that looks like json:

START EVENT:
{
  "data": "{\"hello\": \"there\"}",
  "datacontenttype": "application/json",
  "id": "",
  "source": "example/uri",
  "specversion": "1.0",
  "type": "example.type"
}

HTTP:
{
  "data": "{\"hello\": \"there\"}",
  "datacontenttype": "application/json",
  "id": "17270e4f-7fac-4b35-b96a-0abec69f8598",
  "source": "example/uri",
  "specversion": "1.0",
  "time": "2023-04-21T02:02:02.869891943Z",
  "type": "example.type"
}

FINAL EVENT:
{
  "data": "{\"hello\": \"there\"}",
  "datacontenttype": "application/json",
  "id": "17270e4f-7fac-4b35-b96a-0abec69f8598",
  "source": "example/uri",
  "specversion": "1.0",
  "time": "2023-04-21T02:02:02.869891943Z",
  "type": "example.type"
}

Still seems right to me.

Are you saying that the C# example you gave is wrong? It looks right to me too. You gave it a JSON string, so you get back a JSON string. Not a JSON object.

Try your example with "{[[[}" and see what happens. IMO it should not complain about invalid JSON in the string... it's just an array of bytes.

START EVENT:
{
  "data": "{[[}",
  "datacontenttype": "application/json",
  "id": "",
  "source": "example/uri",
  "specversion": "1.0",
  "type": "example.type"
}

HTTP:
{
  "data": "{[[}",
  "datacontenttype": "application/json",
  "id": "b25be358-3a59-4965-962e-786533156bba",
  "source": "example/uri",
  "specversion": "1.0",
  "time": "2023-04-21T02:02:02.870163666Z",
  "type": "example.type"
}

FINAL EVENT:
{
  "data": "{[[}",
  "datacontenttype": "application/json",
  "id": "b25be358-3a59-4965-962e-786533156bba",
  "source": "example/uri",
  "specversion": "1.0",
  "time": "2023-04-21T02:02:02.870163666Z",
  "type": "example.type"
}

@jskeet
Copy link
Contributor

jskeet commented Apr 21, 2023

Are you saying that the C# example you gave is wrong? It looks right to me too. You gave it a JSON string, so you get back a JSON string. Not a JSON object.

I'm saying that it's different to the Java SDK, and at least plausibly violating the bit of the spec that Jem quoted. (It looks like that piece of the spec is subject to interpretation, which is never great.)

At the very least, it means that we don't have portability across formats - converting from one format to another and back again in the "obvious" way can easily lose information.

@duglin
Copy link
Collaborator Author

duglin commented Apr 21, 2023

To me, the key is the first character after the "data": . If it's { then it's a JSON Object. If it's " then it's a JSON String and not a JSON Object that has been encoded as a JSON String that needs to be decoded by the SDK. The app should see it as a string.

Which part of the spec does the above example violate? This sentence:

The data value MUST be stored directly as a JSON value, rather than as an encoded JSON document
represented as a string. 

(to me) is consistent with my first sentence in this comment... the first char after "data": tells you how to interpret what comes next... Object vs String.

Another way I view it is this, for :

"data": "hello",
"data": "[[[",
"data": "{ \"hello\": \"there\" },

Is someone suggesting that each of these might result in a different interpretation? ie. string, error and json object respectively? If so, that doesn't seem right since each of these are valid JSON Values (strings) and I don't see how else someone could encode them in the CE to say "these are strings, don't try to decode them as json".

What's interesting is if someone wanted to say that "{\"hello\":\"there\"}" should be a JSON Object, then the logic being suggested is:

  • remove the outer quotes
  • parse as JSON

but then that same logic doesn't apply for "hello". Remove the quotes, parse hello as JSON and things fail because hello by itself isn't valid JSON. So, is that same person suggesting that "data": "hello" isn't a valid CE? I would hope not.

There could be a bug someplace, but I'm not seeing it being the spec....unless I'm just not groking this

@jskeet
Copy link
Contributor

jskeet commented Apr 21, 2023

To me, the key is the first character after the "data":

But that's the result of serialization. The problem is in the act of serializing (not deserializing; I don't believe that part is controversial). What Jem is saying is that if an SDK has a bunch of bytes (or a string) as the data, and the data content type has been set to application/json, it should be interpreted by the SDK as already being JSON - and that the act of serializing the CloudEvent in the JSON format should be to represent the JSON value that's already JSON-encoded in that text. That means the SDK would need to either parse or at least validate that it is JSON to start with, which is non-trivial, of course.

Basically I think the distinction is between whether data content type should be used to indicate "this is already the type of data that I've presented to you" or "this is what I want it to be".

@duglin
Copy link
Collaborator Author

duglin commented Apr 21, 2023

How is that "bunch of bytes" given to the SDK? Are we talking about it's part of a stream from bytes from an HTTP body? If so, there's still a char/byte after the "data": to determine things. If it's event.SetData(xxx) then the determining thing to me is whatever xxx is. If it's a "bunch of bytes" then they're giving the SDK a "bunch of bytes" and on the other end the receiver should see "a bunch of bytes". So... it's either a string or a byte array. But I don't see why someone would think it's a JSON object.

I guess I view this is an SDK issue in that each SDK needs to decide how it wants its users to populate data so the SDK knows how to serialize it correctly and if it choose a setter mechanism that isn't clear on the user's intent then it needs to fix that. E.g. given x="{\"hello\":\"there\"}" there may need to be event.SetDataAsString(x) and event.SetDataAsJson(x).

But I don't think the main spec itself is ambiguous, or that a JSON serialized CE is open to interpretation w.r.t. what data is.

What's interesting is your question about whether data content type should influence things - and my current thinking "no" since I think whatever we do should work correctly even when that property is null - and should net the same results as when the property is set. Imagine how annoyed a user would be if they never set the property, things worked but then decided to be explicit about everything by setting this property and suddenly the serialization of their CEs changes and breaks things. That feels bad.

@jskeet
Copy link
Contributor

jskeet commented Apr 21, 2023

Are we talking about it's part of a stream from bytes from an HTTP body? If so, there's still a char/byte after the "data": to determine things.

Again, that would only be if we were deserializing a JSON-format CloudEvent to start with. That's not the situation we're talking about.

The two scenarios described so far are:

  • The CloudEvent was received in protobuf format, with the payload being represented either as a byte array or a string
  • The CloudEvent was constructed in code, with a string or byte array as the payload

If it's a "bunch of bytes" then they're giving the SDK a "bunch of bytes" and on the other end the receiver should see "a bunch of bytes"

That's definitely not what the spec says at the moment though. (Storing the bytes directly in data_base64 wouldn't be compliant with "If the datacontenttype declares the data to contain JSON-formatted content, a JSON serializer MUST translate the data value to a JSON value, and use the member name data to store it inside the JSON representation."

I do believe we've got a problem if we're not consistent: if the result of steps of:

  • Parse from Protobuf format to a CloudEvent
  • Serialize that CloudEvent in JSON format

(Or other combinations...)

... comes up with significantly different results depending on the SDK used, I think that's an issue.

I'd be open to saying "The JSON format spec is unfortunate here, and we should revisit it (and therefore the Java SDK handling)" but I'm not convinced yet in either direction. But I am convinced there's a problem.

@pierDipi
Copy link
Member

pierDipi commented Apr 21, 2023

Data = "{\"Hello\": \"there\"}",

is this a JSON string? isn't a JSON string something like Data = "\"{\"Hello\": \"there\"}\"" as per https://www.rfc-editor.org/rfc/rfc8259#section-7? (I mean as a sequence of bytes "{\"Hello\": \"there\"}")

@pierDipi
Copy link
Member

I tried with java SDK for some examples you have showed here (no HTTP deserialization):

        final CloudEvent[] events = new CloudEvent[]{
            CloudEventBuilder.v1()
            .withId(UUID.randomUUID().toString())
            .withType("foo")
            .withSource(URI.create("/foo"))
            .withData("application/json", "\"{\"Hello\": \"there\"}\"".getBytes(StandardCharsets.UTF_8))
            .build(),
            CloudEventBuilder.v1()
                .withId(UUID.randomUUID().toString())
                .withType("foo")
                .withSource(URI.create("/foo"))
                .withData("application/json", "{\"Hello\": \"there\"}".getBytes(StandardCharsets.UTF_8))
                .build(),
            CloudEventBuilder.v1()
                .withId(UUID.randomUUID().toString())
                .withType("foo")
                .withSource(URI.create("/foo"))
                .withData("application/json", "{[[[}".getBytes(StandardCharsets.UTF_8))
                .build(),
            CloudEventBuilder.v1()
                .withId(UUID.randomUUID().toString())
                .withType("foo")
                .withSource(URI.create("/foo"))
                .withData("application/json", "\"{[[[}\"".getBytes(StandardCharsets.UTF_8))
                .build()
        };

        for (final CloudEvent ce: events) {
            byte[] bytes = getFormat().serialize(ce);
            System.out.println(new String(bytes, StandardCharsets.UTF_8));
        }

output:

{"specversion":"1.0","id":"eb3a465f-7a48-4370-aa14-9845c2a2f4c9","source":"/foo","type":"foo","datacontenttype":"application/json","data":"{"Hello": "there"}"}
{"specversion":"1.0","id":"915c3b33-9f13-475c-b2d8-64f49c000fb0","source":"/foo","type":"foo","datacontenttype":"application/json","data":{"Hello": "there"}}
{"specversion":"1.0","id":"e2f115dc-3c9c-46e5-937f-f271a37d69a0","source":"/foo","type":"foo","datacontenttype":"application/json","data":{[[[}}
{"specversion":"1.0","id":"534d04c2-ff0a-4124-813a-882d894d6ba4","source":"/foo","type":"foo","datacontenttype":"application/json","data":"{[[[}"}

It doesn't do any special processing it is just considering it as sequence of bytes. it even outputs invalid JSONs with data = {[[}, and that's not a JSON string, a JSON string is "{[[}" (like quote included)

@pierDipi
Copy link
Member

pierDipi commented Apr 21, 2023

I'm not sure what's more wrong but to me the fact that Java SDK serialize to invalid JSON is wrong, however, for the {[[[} case, I'd expect to have data_base64 more than a quoted string because technically what {[[[} is can be a non UTF-8 sequence of bytes that is not a valid JSON string.

@duglin
Copy link
Collaborator Author

duglin commented Apr 21, 2023

I'm not sure the data_base64 plays a role in this since the oft-quoted text only talks about data and how to show the payload when both the format and data are JSON.

To me the quoted text is very clear... if you have JSON you can stick it directly under data when the format of the CE is structured JSON. No need for any special encoding. So again, perhaps its one of our others specs (or SDKs) that's broken but I'm still not seeing a problem with the core spec, or even the JSON spec yet.

As for "bunch of bytes in == bunch of bytes out", the spec doesn't need to say that directly because I believe it's implied in the sense that people should expect their event payload remains unchanged between the sending and receiving of the event. If an SDK breaks this assumption then I think the SDK is broken.

@pierDipi I think we may need to be more precise on what the term "JSON string" means and its context. I think we agree that when a string is serialized in JSON then quotes are added (string =hello, json serialization = "hello").

So:

Data = "{\"Hello\": \"there\"}",

to me is passing in a string, and it doesn't become a JSON string until the complete JSON is created. And at that point the string {"Hello": "there"} gets wrapped with quotes since that's the JSON serialization rule for strings.

The question is whether .withData() or .setData() are expecting the data to be "as seen by the code" or not. My default assumption is that it's "as seen by the code". Meaning, I call it with the raw event payload - no encoding, no escaping because for the most part the code shouldn't know about the format being used to send the event, and it certainly shouldn't be expected to encode/escape things on a per format basis. So to me, the code behind .withData() or .setData() should do whatever is needed to get it on the wire. Which is why I agree with you that it seems wrong for the Java SDK to produce invalid JSON for {[[[}. The exception to this might be if the method was something like .withAlreadyEncodedData() :-) then it's clear the caller of the SDK is advanced and didn't want the SDK to do any work at all. Do you think that was the intent of withData??

@pierDipi
Copy link
Member

Ok, I see, Java SDK represents data only as bytes (byte[]), so withData accepts only bytes and therefore there isn't really a possibility for misinterpretation (aka you pass x, you get x)

@duglin
Copy link
Collaborator Author

duglin commented Apr 21, 2023

that would imply they expect people to know the format and encode/escape things themselves.... that's interesting.

@pierDipi
Copy link
Member

pierDipi commented Apr 21, 2023

Not only people, even things like proto text data to JSON format, the Java SKD loses the quotes:

        io.cloudevents.v1.proto.CloudEvent protoCe = io.cloudevents.v1.proto.CloudEvent
            .newBuilder()
            .setSpecVersion("1.0")
            .setId(UUID.randomUUID().toString())
            .setType("foo")
            .setSource("/foo")
            .setTextData("hello")
            .build();

      CloudEvent ce  = new ProtoDeserializer(protoCe).read(CloudEventBuilder::fromSpecVersion);
      
      byte[] bytes = new JsonFormat().serialize(ce);
      System.out.println(new String(bytes, StandardCharsets.UTF_8));

output:

{"specversion":"1.0","id":"b80e46fe-3428-4733-961e-ad56339e55d6","source":"/foo","type":"foo","data":hello}

that's not correct

@duglin
Copy link
Collaborator Author

duglin commented Apr 21, 2023

Is there javadoc available? I see docs here: https://cloudevents.github.io/sdk-java/api.html but no formal api definition. Or is javadoc too old school :-)

@pierDipi
Copy link
Member

@duglin
Copy link
Collaborator Author

duglin commented Apr 21, 2023

thanks!

@JemDay
Copy link
Contributor

JemDay commented Apr 21, 2023 via email

@pierDipi
Copy link
Member

pierDipi commented Apr 26, 2023

@JemDay, I'm matching the implementation.

This part

        io.cloudevents.v1.proto.CloudEvent protoCe = io.cloudevents.v1.proto.CloudEvent
            .newBuilder()
            .setSpecVersion("1.0")
            .setId(UUID.randomUUID().toString())
            .setType("foo")
            .setSource("/foo")
            .setTextData("hello")
            .build();

      CloudEvent ce  = new ProtoDeserializer(protoCe).read(CloudEventBuilder::fromSpecVersion);

is equivalent to:
https://github.com/cloudevents/sdk-java/blob/4ebeab0e0ff8d0c507c47b32dd7ec2da95dc87cb/formats/protobuf/src/main/java/io/cloudevents/protobuf/ProtobufFormat.java#L61-L69

    public CloudEvent deserialize(byte[] bytes, CloudEventDataMapper<? extends CloudEventData> mapper)
	    throws EventDeserializationException {
        try {
            final io.cloudevents.v1.proto.CloudEvent ceProto = io.cloudevents.v1.proto.CloudEvent.parseFrom(bytes);
            return new ProtoDeserializer(ceProto).read(CloudEventBuilder::fromSpecVersion);
        } catch (InvalidProtocolBufferException e) {
            throw new EventDeserializationException(e);
        }
    }

and as you can see, text data is then passed as raw bytes without quotes during deserialization and therefore the Json format just passes the data leading to the invalid JSON:
https://github.com/cloudevents/sdk-java/blob/4ebeab0e0ff8d0c507c47b32dd7ec2da95dc87cb/formats/protobuf/src/main/java/io/cloudevents/protobuf/ProtoDeserializer.java#L99-L102

@JemDay
Copy link
Contributor

JemDay commented Apr 26, 2023

Apologies for the delay, I've been out-of-town and I'm trying to keep up on this thread...

HTTP is a red-herring, we should be concentrating on the 'format' not the transport.

My earlier comment was relating to the builder used in that proto 'example' - it's not the correct one. Application code does-not (or should not) be using those constructs...

The CE Builder is the same irrespective of the format being used ...

If we look at this:

String jText = "{ \"name\" : \"bob\" }";

CloudEventBuilder b = new CloudEventBuilder();

CloudEvent ce = b.withId("xxx-yyy-xxx")
.withType("org.jem.test")
.withSource(URI.create("http://my.source/"))
.withData("application/json", jText.getBytes())
.build();

EventFormat ef = EventFormatProvider.getInstance().resolveFormat("application/cloudevents+json");

byte[] raw = ef.serialize(ce);

Produces this output on-the-wire (which adheres to the spec as I understand it):

{"specversion":"1.0","id":"xxx-yyy-xxx","source":"http://my.source/","type":"org.jem.test","datacontenttype":"application/json","data":{ "name" : "bob" }}
For a proto example all you want to do is switch the format to ....

EventFormat ef = EventFormatProvider.getInstance().resolveFormat("application/cloudevents+proto");

And then you'd need to have some example code that blindly demarshalls the buffer, and then re-marshals it using the "proto json format" ... this is what the proto Unit Tests do to ensure the wire-format is as expected.

-- Apologiers for so many edits, that's me trying to rush an answer to the thread.

@pierDipi
Copy link
Member

Apologies for the delay, I've been hut-of-town and I'm trying to keep up on this thread...

HTTP is a red-herring, we should be concentrating on the 'format' not the transport.

My earlier comment was relating to the builder used in that proto 'example' - it's not the correct one. Application code does-not (or should not) be using those constructs...

The CE Builder is the same irrespective of the format being used ...

If we look at this:

String jText = "{ \"name\" : \"bob\" }";

CloudEventBuilder b = new CloudEventBuilder();

CloudEvent ce = b.withId("xxx-yyy-xxx")
.withType("org.jem.test")
.withSource(URI.create("http://my.source/"))
.withData("application/json", jText.getBytes())
.build();

EventFormat ef = EventFormatProvider.getInstance().resolveFormat("application/cloudevents+json");

byte[] raw = ef.serialize(ce);

Produces this output on-the-wire (which adheres to the spec as I understand it):

{"specversion":"1.0","id":"xxx-yyy-xxx","source":"http://my.source/","type":"org.jem.test","datacontenttype":"application/json","data":{ "name" : "bob" }} For a proto example all you want to do is switch the format to ....

EventFormat ef = EventFormatProvider.getInstance().resolveFormat("application/cloudevents+proto");

And then you'd need to have some example code that blindly demarshalls the buffer, and then re-marshals it using the "proto json format" ... this is what the proto Unit Tests do to ensure the wire-format is as expected

This won't work when jText = "a string"

@JemDay
Copy link
Contributor

JemDay commented Apr 26, 2023

This won't work when jText = "a string"

I may be wrong (probably am) but I don't believe you can call withData with a String object on the CloudEvent Builder ...

@pierDipi
Copy link
Member

pierDipi commented Apr 26, 2023

Right, you're not wrong, we need something like StringCloudEventData object that is instantiated by protobuf when text_data is used and properly handled by other formats (JSON being one of them)

@JemDay
Copy link
Contributor

JemDay commented Apr 26, 2023

I agree there is some cleanup we can do but I think we need to "step away" from proto for a minute and limit ourselves to the JSON Format implementation as that was the issue at-hand.

We need crystal-clear guidance as to whether this is considered valid from a JSON Format specification perspective..

  ....
 "datacontenttype" : "application/json",
 "data" : "{  \"name\" : \"Kevin\" }",
 ....

or is the excpectation it's:

  ...
 "datacontenttype" : "application/json",
 "data" : {  "name":  "Kevin" },
  ....

@aucampia
Copy link
Contributor

It would be good to know what other SDKs do in the same scenario.

@jskeet, I tested more or less the same thing for Java [ref]:

public class SimpleTests {
  private static final Logger LOGGER = Logger.getLogger(SimpleTests.class.getName());
  private static final JsonNodeFactory jnf = JsonNodeFactory.instance;

  @Test
  void jsonSerializeString() throws Exception {

    var event =
        CloudEventBuilder.v1()
            .withId("Issue1186")
            .withType("example.type")
            .withSource(URI.create("example/uri"))
            .withDataContentType("application/json")
            .withData("{\"Hello\": \"there\"}".getBytes(StandardCharsets.UTF_8))
            .build();

    var formatOptions = JsonFormatOptions.builder().forceStringSerialization(true).build();
    var format = new JsonFormat(formatOptions);

    var serialized = format.serialize(event);
    LOGGER.info(() -> String.format("serialized = %s", new String(serialized)));

    var mapper = new ObjectMapper();
    var parsed = mapper.readTree(serialized);
    LOGGER.info(() -> String.format("parsed = %s", parsed));

    var dataNode = parsed.get("data");
    assertEquals(jnf.objectNode().<ObjectNode>set("Hello", jnf.textNode("there")), dataNode);
  }
}

That test passes, and the output is:

SimpleTests > jsonSerializeString() STANDARD_ERROR
    Apr 29, 2023 12:56:48 PM io.github.aucampia.issue.cejsonstring.SimpleTests jsonSerializeString
    INFO: serialized = {"specversion":"1.0","id":"Issue1186","source":"example/uri","type":"example.type","datacontenttype":"application/json","data":{"Hello": "there"}}
    Apr 29, 2023 12:56:48 PM io.github.aucampia.issue.cejsonstring.SimpleTests jsonSerializeString
    INFO: parsed = {"specversion":"1.0","id":"Issue1186","source":"example/uri","type":"example.type","datacontenttype":"application/json","data":{"Hello":"there"}}

The output formatted with jq is:

{
  "specversion": "1.0",
  "id": "Issue1186",
  "source": "example/uri",
  "type": "example.type",
  "datacontenttype": "application/json",
  "data": {
    "Hello": "there"
  }
}

So Java is behaving differently, the specific code that is causing this behaviour is this

I would say the behaviour of the Java SDK is correct because the data in this case is essentially just the bytes of {"Hello": "there"}, and the spec for JSON format says:

If the `datacontenttype` declares the data to contain JSON-formatted content, a
JSON serializer MUST translate the data value to a [JSON value][json-value], and
use the member name `data` to store it inside the JSON representation. The data
value MUST be stored directly as a JSON value, rather than as an encoded JSON
document represented as a string. An implementation MAY fail to serialize the
event if it is unable to translate the runtime value to a JSON value.

I will test the same in Python. Not sure if this answers all the questions here, if you would like me to test more things let me know.

@aucampia
Copy link
Contributor

aucampia commented Apr 29, 2023

I will test the same in Python. Not sure if this answers all the questions here, if you would like me to test more things let me know.

I have now tested similar things in python [ref].

The most direct test in python behaves the same as C#:

def test_json_serialize_string() -> None:
    event = CloudEvent.create(
        {
            "specversion": "1.0",
            "type": "example.type",
            "source": "example/uri",
            "datacontenttype": "application/json",
        },
        data='{"Hello": "there"}',
    )

    logging.debug("event = %s", event)
    event_json = conversion.to_json(event)
    logging.debug("event_json = %s", event_json)
    parsed = json.loads(event_json)
    logging.debug("parsed['data'] = %s", parsed["data"])
    assert parsed["data"] == '{"Hello": "there"}'

Test output:

tests/test_simple.py::test_json_serialize_string 
------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG    root         test_simple:21:test_json_serialize_string event = {'attributes': {'specversion': '1.0', 'type': 'example.type', 'source': 'example/uri', 'datacontenttype': 'application/json', 'id': '334cb7f8-3f5f-40f4-89e5-3caed0e17e54', 'time': '2023-04-29T14:11:39.531902+00:00'}, 'data': '{"Hello": "there"}'}
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG    root         test_simple:23:test_json_serialize_string event_json = b'{"specversion": "1.0", "id": "334cb7f8-3f5f-40f4-89e5-3caed0e17e54", "source": "example/uri", "type": "example.type", "datacontenttype": "application/json", "time": "2023-04-29T14:11:39.531902+00:00", "data": "{\\"Hello\\": \\"there\\"}"}'
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG    root         test_simple:25:test_json_serialize_string pretty event_json = 
{
  "specversion": "1.0",
  "id": "334cb7f8-3f5f-40f4-89e5-3caed0e17e54",
  "source": "example/uri",
  "type": "example.type",
  "datacontenttype": "application/json",
  "time": "2023-04-29T14:11:39.531902+00:00",
  "data": "{\"Hello\": \"there\"}"
}
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG    root         test_simple:26:test_json_serialize_string parsed['data'] = '{"Hello": "there"}'
PASSED                                                                                                                                                                [ 33%]

I can somewhat accept this behaviour, the obvious challenge with python is that json.loads('"ABC"') == "ABC", Python has no analogue to com.fasterxml.jackson.databind.node.TextNode.

If using a binary value for JSON, instead of a string, the behaviour is different and I would say more correct, or possibly less surprising:

def test_json_serialize_bytes() -> None:
    event = CloudEvent.create(
        {
            "specversion": "1.0",
            "type": "example.type",
            "source": "example/uri",
            "datacontenttype": "application/json",
        },
        data=b'{"Hello": "there"}',
    )

    logging.debug("event = %s", event)
    event_json = conversion.to_json(event)
    logging.debug("event_json = %s", event_json)
    parsed = json.loads(event_json)
    logging.debug("pretty event_json = \n%s", json.dumps(parsed, indent=2))
    logging.debug("parsed['data_base64'] = %s", parsed["data_base64"])
    logging.debug(
        "decoded parsed['data_base64'] = %r",
        base64.b64decode(parsed["data_base64"]).decode(),
    )
    assert parsed["data_base64"] == base64.b64encode(b'{"Hello": "there"}').decode()

Test output:

tests/test_simple.py::test_json_serialize_bytes 
------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG    root         test_simple:41:test_json_serialize_bytes event = {'attributes': {'specversion': '1.0', 'type': 'example.type', 'source': 'example/uri', 'datacontenttype': 'application/json', 'id': '8be02a3b-c362-441c-8bd3-56876e1e2810', 'time': '2023-04-29T14:11:39.533045+00:00'}, 'data': b'{"Hello": "there"}'}
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG    root         test_simple:43:test_json_serialize_bytes event_json = b'{"specversion": "1.0", "id": "8be02a3b-c362-441c-8bd3-56876e1e2810", "source": "example/uri", "type": "example.type", "datacontenttype": "application/json", "time": "2023-04-29T14:11:39.533045+00:00", "data_base64": "eyJIZWxsbyI6ICJ0aGVyZSJ9"}'
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG    root         test_simple:45:test_json_serialize_bytes pretty event_json = 
{
  "specversion": "1.0",
  "id": "8be02a3b-c362-441c-8bd3-56876e1e2810",
  "source": "example/uri",
  "type": "example.type",
  "datacontenttype": "application/json",
  "time": "2023-04-29T14:11:39.533045+00:00",
  "data_base64": "eyJIZWxsbyI6ICJ0aGVyZSJ9"
}
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG    root         test_simple:46:test_json_serialize_bytes parsed['data_base64'] = eyJIZWxsbyI6ICJ0aGVyZSJ9
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG    root         test_simple:47:test_json_serialize_bytes decoded parsed['data_base64'] = '{"Hello": "there"}'
PASSED                                                                                                                                                                [ 66%]

My expectation with Protobuf CloudEvents text_data, in case of Python (which does not have Protobuf support), would be that the content of text_data is converted to bytes, bytearray, memoryview so that isinstance(event.get_data(), (bytes, bytearray, memoryview)) is true [ref].

I guess in every language things will and probably should be a bit different.

@aucampia
Copy link
Contributor

My expectation with Protobuf CloudEvents text_data, in case of Python (which does not have Protobuf support), would be that the content of text_data is converted to bytes, bytearray, memoryview so that isinstance(event.get_data(), (bytes, bytearray, memoryview)) is true [ref].

I guess in every language things will and probably should be a bit different.

For C#, my conclusion would be, if C# has an analogue to com.fasterxml.jackson.databind.node.TextNode - that this is the wrong behaviour:

using CloudNative.CloudEvents;
using CloudNative.CloudEvents.SystemTextJson;
using System.Text;

var evt = new CloudEvent
{
    Id = "Issue1186",
    Type = "example.type",
    Source = new Uri("example/uri", UriKind.RelativeOrAbsolute),
    Data = "{\"Hello\": \"there\"}",
    DataContentType = "application/json"
};

var formatter = new JsonEventFormatter(new() { WriteIndented = true }, new());
var bytes = formatter.EncodeStructuredModeMessage(evt, out _);
Console.WriteLine(Encoding.UTF8.GetString(bytes.Span));

Output:

{
  "specversion": "1.0",
  "id": "Issue1186",
  "type": "example.type",
  "source": "example/uri",
  "datacontenttype": "application/json",
  "data": "{\u0022Hello\u0022: \u0022there\u0022}"
}

If on the other hand, parsing a JSON string (i.e. "foobar") results in C# string "foobar", as is the case with Python, I would say that the C# behaviour can be excused. However, similar to my expectation for Python, my expectation for C# would be that text_data from a Protobuf CloudEvent should be converted to a binary type so that it is not confused with a JSON string.

@duglin
Copy link
Collaborator Author

duglin commented May 2, 2023

For C#, what is CE.Datadefined as? How are these interpreted:

  • Data = "{}"
  • Data = "hello"
  • Data = "{"

@jskeet
Copy link
Contributor

jskeet commented May 2, 2023

It's defined with a compile-time type of object (broadly equivalent to java.lang.Object). In all of those cases, we'd end up serializing the value as a JSON string, so "data": "{}", "data": "hello" and "data": "{".

@duglin
Copy link
Collaborator Author

duglin commented May 2, 2023

Then it seems like Data = "{\"Hello\": \"there\"}", means that we're giving it a string with text that just happens to have curly braces. And serializing it should yield a string, and then deserializing it should yield the same string with curly braces, not a json object. So, the output shown in #1186 (comment) seems right to me because C#'s CE.Data is the raw event, and not a serialization of the event.

@duglin
Copy link
Collaborator Author

duglin commented May 2, 2023

Circling back to the original comment:

However, in the Go SDK's deserializer binary_data is treated literally, while text_data gets serialized using the appropriate serializer for the datacontenttype.

based on my reading (and lack of understanding of proto :-) ), it seems to me that text_data and binary_data are meant to contain the serialized versions of the event. Serialized according the datacontenttype field, if present. And people would choose one over the other based on whether that serialized form can be written as plain text or not. Of course, if so then technically it could be in either field.

If the golang or java SDKs are doing something different between those two fields (deserializing in one but not the other) then that would seem incorrect because the choice between the two is based on whether it has special characters, NOT because we have two different (de)serialization algorithms at play.

@github-actions
Copy link

github-actions bot commented Jun 2, 2023

This issue is stale because it has been open for 30 days with no
activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

@evankanderson
Copy link
Contributor

Wandering into this much later, I can see each of the intermediate steps, but this is a deeply weird conversation given that the on-the-wire serialization of bytes and string is the same. 😁

I believe that string data in some implementations may enforce extra validity on the length-delimited set of bytes, and it may map it into a different encoding format, but it feels like maybe text_data should not be a separate field, and the language SDKs should provide an easy way to read the binary_data as a String type in the representative language (with all the pain that strings and string encoding brings, or the language can ignore that and just take a swing at UTF-8).

Copy link

github-actions bot commented Sep 2, 2024

This issue is stale because it has been open for 30 days with no
activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants