Why were metrics of "helpfulness" prioritised over "correctness", and what will MDN / Mozilla do to prevent further statistical misinformation? #412

Ultrabenosaurus · 2023-07-08T13:47:41Z

Ultrabenosaurus
Jul 8, 2023

The blog post by Steve Teixeira, following initial pushback on these "AI" features, shared some "helpfulness" statistics via screenshot and referenced them in the text of the post. This metric, of the output being "helpful", was also used by Class Augner (caugner) and Florian Dieminger (fiji-flo) in their responses to the issues opened here on GitHub.

These prominent uses in place of any other metric imply "helpful" is a major, if not the primary, metric for those involved in AI Help and AI Explain at MDN / Mozilla.

To quote obfusk on issue #9230:

"That tells you that the user considered it helpful. It doesn't tell you whether it was correct. It's easy to mistake a plausible, confident-sounding -- but actually misleading or incorrect -- answer for a helpful one. I see no safeguards to prevent this."

the data so far seems to indicate a sense of skepticism towards AI and LLMs in general, while those who have tried the features to find answers tend to be happy with the results.

For AI Help in particular, the feedback indicates that the majority of people who used this feature and voted consider the answers to be helpful.

In the case of AI Explain, the pattern of feedback we received was similar,

AI Assistants: A Helpful Complement

Importantly, this framing and showcasing of the "helpful" statistics is misleading to the point I imagine many people would deem it a lie. Even just the column names "Positive Feedback %" and "Negative Feedback %" could easily induce notable bias; they read as though they show how much positive and negative feedback has been received, when in actual fact they show how much of the feedback received was positive or negative. These are two very different concepts, but easily misinterpreted - especially if those data points are shared without the rest of the data for context.

In terms of the actual data, it clearly shows over 24,000 unique users and over 44,000 clicks on AI Explain but only 1,017 likes. Of people who submitted feedback, yes, a notable majority liked the "AI" output of these tools. However, 1,146 of 48,380 usages of AI Explain and AI Help combined - only 2.37% positive feedback - is a tiny amount of support for these features; fewer than 3.5% of total uses of both features resulted in any feedback whatsoever including 1.03% indicating the result was "not helpful".

Are Steve / MDN / Mozilla counting non-answers as positive feedback? That is the only way to explain the apparent fixation on this paltry number as some sort of holy grail of support to justify these "AI" tools' continued existence on MDN, but would be utterly ridiculous.

This data, held up by Steve, objectively does not scream "those who have tried the features to find answers tend to be happy with the results" - barely any responses, and nearly a 3rd negative. Yes it is notably in favour, of those who responded, but a third failure rate being reported is not good enough for technical documentation that must aim to be correct and accurate above all other goals or it fails its only purpose - informing and educating people about technical concepts and their implementation.

Further, this apparent love for statistics presented by Steve, however misleading, also completely fails to include the feedback statistics of the GitHub issues for these two tools. At the time of writing, the original post for issue #9208 has 1,287 likes against "AI" to 4 dislikes in favour (99.7% against) and issue #9230's OP has 147 likes against "AI" and 0 dislikes in favour. Adding these numbers from the GitHub issues into the Likes / Dislikes data from Steve's screenshot provides 1,150 in support and 1,934 against - only 37.29% positive overall!

A single issue on GitHub has more votes against these tools than MDN / Mozilla have in support of them - 1,287 vs. 1,146.

My questions are follows:

Why are metrics of "helpfulness" prioritised over "correctness"?
How will MDN / Mozilla pivot to "correctness" metrics over "helpfulness"?
What are MDN / Mozilla doing to ensure correctness of output from AI Explain and AI Help?
Why were the GitHub issue like / dislike statistics explicitly excluded from the blog post?
What will MDN / Mozilla do to apologise for and correct this statistical misinformation?
What will MDN / Mozilla do to prevent further statistical misinformation and ensure transparency of data collection regarding these features?

LeoMcA · 2023-07-18T11:00:31Z

LeoMcA
Jul 18, 2023
Maintainer

Why are metrics of "helpfulness" prioritised over "correctness"?

Because that's the point of this feature, to be helpful!
Obviously in an ideal world everything is both helpful and correct all the time, but that's not possible in the real world
If you consider that MDN's reference documentation starts from a place of correctness (it is necessarily correct, it would fail to meet its objective if it wasn't), we then also aim and hope to make it helpful, and ideally it is, but it doesn't have to be to succeed as reference documentation
Consider AI Help as the inverse: it starts from a place of helpfulness (it is necessarily helpful, if it's not, then it fails in its objective), we then also aim to make it correct, and ideally it is, but aside from it not having to be totally correct to be helpful, we also - due to the fundamental nature of LLMs - can't make any guarantee about its correctness
You might reasonably disagree on the point about something inaccurate being helpful, I touch on some of those themes on my response to a separate question but fundamentally:
- Developers already rely on inaccurate information en route to the ultimate correct answer, think:
  - stackoverflow answers we've all read which are wrong or outdated
  - peer learning groups where wrong ideas are thrown around and analysed, learning in the process
  - more experienced developers just not knowing very specific nuances and not explaining them, but still pointing you in the correct direction
- If you're totally stuck, something, anything, is better than nothing, even if only partially accurate: it restates your problem in a different way, gives you another thread to go research, an API that might partially help which you didn't otherwise know about
- These LLM-powered tools are already out there, and we can be pretty confident that the output of AI Help, when asking a question directly referenced in MDN's documentation will in general give more accurate responses than naively asking ChatGPT, because of the context we feed in, and the prompt engineering we've done
  - in this sense, AI Help is helpful, despite no guarantees of total accuracy, because it should be more accurate than ChatGPT

How will MDN / Mozilla pivot to "correctness" metrics over "helpfulness"?

We won't, not just because it's not the primary metric we're tracking here, but also because - as you've admitted in your question - it's not possible to gather that data from users, they may not know if the answer is entirely correct or not
What they can answer, and we can therefore gather metrics on, is whether they found the answer helpful

What are MDN / Mozilla doing to ensure correctness of output from AI Explain and AI Help?

AI Explain is retired for now, as explained in the postmortem blogpost
With AI Help, as well a thumbs up/down link (to gather helpfulness metrics), there is a link to report an issue with an answer, and those are gathered here: https://github.com/mdn/ai-feedback/issues
This allows us to analyse specific answers and see what the issue underlying the inaccuracy/lack of helpfulness is:
- Sometimes this is a simple bug in our implementation
- Sometimes this is a content gap (i.e. the LLM is attempting to answer with very little relevant context from MDN)
- Sometimes this is a similarity search problem (i.e. something not semantically similar to the question is returned by that step and fed into the LLM)
- Sometimes this is a problem with the LLM itself (this one is much more out of our hands, we can change the prompt we use, but that could have knock-on effects with other answers)
But hopefully you can see there's a number of vectors to improving the accuracy/helpfulness of answers, which don't rely on changing the base model used, and gathering feedback about specific answers really helps us do that beyond general arguments about LLMs being "bad"

Why were the GitHub issue like / dislike statistics explicitly excluded from the blog post?

We did mention the negative feedback coming from our community
However those are not product data - while a helpful indicator of community sentiment towards the feature, it tells us nothing about how they interacted with it - the inclusion of helpfulness metrics were done as such

5 replies

joelanman Jul 18, 2023

it really doesn't feel like the community is being listened to. But just on one point - measuring correctness, this is, of course, possible. You could have experts review the answers for example. This idea that correctness is not something you can aim for on a service that provides technical documentation is just... so sad.

obfusk Jul 18, 2023

I keep seeing the proponents of this conflate seeming to be helpful with actually being helpful. Some of the provided examples seemed helpful to me but turned out to contain inaccuracies that resulted in leaving the user with a flawed understanding. IMO this is harmful precisely because they seemed helpful. I may think I learned how a feature works, not realising I just learned something that's not actually correct.

Your argument that this is fine rests on the assumption that there is no meaningful difference between inaccurate information provided by well-meaning people (e.g. on stack overflow) and the kind of inaccurate information that an LLM can produce. But LLMs will produce different kinds of inaccurate information than people would. As I wrote:

Then you should not be using an LLM. It's a chatbot. It cannot deliver reliable information no matter how much you want it to or how helpful it may appear to be. Because it's not a machine for doing that. It's a "'say something that sounds like an answer' machine".

But it doesn't just give "incorrect" responses the way a human might if they simply forget or misremember something. It doesn't lie either. It's a model for creating plausible human-sounding conversations. When it produces a "response citing a non-existent paper with a plausible title, using a real journal name and an author name who's written things related to your question", this is not a bug. It is doing exactly what it's meant to do, which is not "deliver[ing] reliable information", no matter how much you want to use it to do that because it seems like it does a lot of the time.

You also have not accounted for biases that make people trust machines more or less, depending on circumstances, than people. Maybe this works out well, making people more likely double-check the LLM output than they would a stack overflow answer. But it could also be the other way around. It's dangerous to simply assume there is no difference.

Ultrabenosaurus Jul 18, 2023
Author

Unmarked as answer as it fails to address questions 5 and 6, and is also rather contradictory: on one hand Leo says that known-inaccurate information is helpful, which is almost never the case for a technical reference like MDN, and that people already view known-likely-inaccurate information such as Stack Overflow, then on the other says MDN documentation must be accurate and developers rely on this accuracy - these are two very different use-cases where readers have very different expectations.

The announcement blog post calls AI Help a "your new problem-solving companion", and uses phrases like "find the information you need" and "equipped with the insights you were looking for". This sort of marketing intentionally encourages strong trust in the tool and provided no real warning or even implication that MDN would intentionally be allowing it to provide inaccurate information.

These LLMs are provided by MDN and embedded directly within MDN documentation, along with strong demands for trust with (originally) no caveat about correctness. In that case, the LLM output forms part of the MDN platform and documentation, so they must also be correct; yet MDN explicitly say they can't be made correct and that MDN will not make correctness a primary metric.

As long as the LLM output is directly integrated into MDN documentation, it cannot be considered a separate thing like MDN seem to be demanding we do. That is doublethink.

Tangentially related, and a direct question on my other Community Call thread following the postmortem, but a simple link asking users to manually raise issues on GitHub with a worthwhile amount of detail to report inaccuracies cannot possibly be considered an attempt to ensure correctness of output - not least because of the time involved when someone is trying to work on their project and needs assistance with what they're doing right now, but as mentioned regularly, a lot of the users most in need of help from these tools are those least likely to be able to tell (or fact-check) the output. This is just putting the onus on users and washing your hands of it.

obfusk Jul 18, 2023

According to the postmortem (bold mine):

AI Explain will not be reintroduced to MDN until we are confident that it delivers reliable information that both readers and contributors can trust.

And yet when we point out that LLMs cannot be a source of reliable information, we keep hearing the argument that this is fine because inaccurate information is supposedly helpful. This directly contradicts the stated goal. I'd like an explanation for that.

meejah Jul 18, 2023

Wow, if the bar is merely "should be more accurate than ChatGPT" then what is even the point?

If people want to cause themselves harm with tools not designed for the job, that's their problem -- but it shouldn't be touted as a reason or justification to promote those tools to others!

Since you bring up "peer learning groups" (aka "meetups"), let us explore that line a little further. As a former local meetup organizer myself, I believe I would be abjectly failing in that role if I let someone come to every meetup and confidence-voice participants with plausible but often-wrong answers. It would be appropriate to ask that person to stop doing that or, failing that, to leave. While it can sometimes be productive for "peers" to explore a problem, they are all humans who usually give some indications when they lack confidence in particular answers. LLMs do not do this -- indeed, they don't have a notion of "wrong". (I put "peers" in quotes because it's likely they have varied human experience and learned expertise).

This is what the community is asking of you here: please stop confidence-voicing naive users of technical documentation with wrong answers!

To continue the analogy: if meetup participants wished to seek out this person and continue to listen to probably-wrong answers outside of the meetup group, that's up to them. That is, if some of your users insist on using ChatGPT for technical answers (even though it is not a tool for that), that's up to them.

kyanha · 2023-07-19T19:33:56Z

kyanha
Jul 19, 2023

Why isn't anyone considering the probability that an LLM promotion brigade is trying to game the metrics? Or perhaps people who don't have Mozilla's users' best interest at heart, and might want to push Mozilla into spending resources in an ultimately unhelpful direction?

Does Mozilla have linkage between the question asked, the answer given, and the rating given? Are the questions asked incredibly simple which get the helpful rating?

The very poll is designed in such a way to generate bad statistics, since it doesn't track outcomes. "This sounds helpful, so I'll mark it as such, but ultimately I can't get it to work so it is actually worse than unhelpful" is a path I've traversed countless times on the internet. The number of people who mark something "helpful" is a non-statistic -- it's just the number of people who you can nudge through the apparent happy path without realizing it's not necessarily a truly happy path.

Does Mozilla have any actual statisticians or poll-developers on this? Or are people stepping out of their realms of competence to do something which seems appropriate?

1 reply

DavidJCobb Jul 22, 2023

Or perhaps people who don't have Mozilla's users' best interest at heart, and might want to push Mozilla into spending resources in an ultimately unhelpful direction?

A lot of readers have considered that possibility; they just think those people are Mozilla themselves. Several users expressed that this was the straw that broke the camel's back for their trust in the organization.

Ultrabenosaurus · 2023-08-13T19:32:13Z

Ultrabenosaurus
Aug 13, 2023
Author

@LeoMcA @caugner @Rumyra will MDN be addressing the inconsistencies and contradictions in Leo's first set of answers I raised in my reply, or answering my 5th and 6th questions about statistical misinformation from my original post?

Just to clarify, when I ask about "statistical misinformation" I'm not referring to helpfulness vs. correctness nor even the intentional exclusion of GitHub feedback in Steve's blog post, but the naming and usage of the actual data collected, as in these paragraphs from my original post:

Importantly, this framing and showcasing of the "helpful" statistics is misleading to the point I imagine many people would deem it a lie. Even just the column names "Positive Feedback %" and "Negative Feedback %" could easily induce notable bias; they read as though they show how much positive and negative feedback has been received, when in actual fact they show how much of the feedback received was positive or negative. These are two very different concepts, but easily misinterpreted - especially if those data points are shared without the rest of the data for context.

In terms of the actual data, it clearly shows over 24,000 unique users and over 44,000 clicks on AI Explain but only 1,017 likes. Of people who submitted feedback, yes, a notable majority liked the "AI" output of these tools. However, 1,146 of 48,380 usages of AI Explain and AI Help combined - only 2.37% positive feedback - is a tiny amount of support for these features; fewer than 3.5% of total uses of both features resulted in any feedback whatsoever including 1.03% indicating the result was "not helpful".

This data, held up by Steve, objectively does not scream "those who have tried the features to find answers tend to be happy with the results" - barely any responses, and nearly a 3rd negative.

1 reply

Ultrabenosaurus Aug 27, 2023
Author

@LeoMcA @caugner @Rumyra have MDN abandoned their promise to answer all community questions?

This comment has been hidden.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDN Web Docs

Why were metrics of "helpfulness" prioritised over "correctness", and what will MDN / Mozilla do to prevent further statistical misinformation? #412

{{title}}

{{editor}}'s edit

{{editor}}'s edit

AI Assistants: A Helpful Complement

Replies: 4 comments 7 replies

This comment has been hidden.

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why were metrics of "helpfulness" prioritised over "correctness", and what will MDN / Mozilla do to prevent further statistical misinformation? #412

AI Assistants: A Helpful Complement

Replies: 4 comments · 7 replies

This comment has been hidden.

LeoMcA Jul 18, 2023 Maintainer

Ultrabenosaurus Jul 18, 2023 Author

Ultrabenosaurus Aug 13, 2023 Author

Ultrabenosaurus Aug 27, 2023 Author

Replies: 4 comments 7 replies

LeoMcA
Jul 18, 2023
Maintainer

Ultrabenosaurus Jul 18, 2023
Author

Ultrabenosaurus
Aug 13, 2023
Author

Ultrabenosaurus Aug 27, 2023
Author