Gary Marcus

Rank 13 of 47

|

Score 84

+3

@lefthanddraft yes exactly

6/27/2025, 3:15:19 AM

In reply to:

Wyatt Walls

@lefthanddraft

·

226d

@GaryMarcus It also has implications for those training, fine-tuning or anyone relying on fine-tunes (another risk for deployers to consider).

And perhaps a broader point about the maturity of the "science of AI alignment"

Wyatt Walls

@lefthanddraft

·

226d

@GaryMarcus I agree it's a surprising result. But the fact that you *could* finetune an offensive model (even accidently) doesn't mean that there is an immediate product safety issue with the actual consumer ChatGPT product.

Gary Marcus

@GaryMarcus

·

226d

@lefthanddraft importantly they didn’t fine tune to do say offensive things. none (?! of what came out was directly related to the content of the fine-tuning.

there are a probably a zillion vulnerabilities like this and the underlying systems has grossly inadequate guardrails, based in

Wyatt Walls

@lefthanddraft

·

226d

@GaryMarcus Ok, so then I don't get how get from:
1. it's possible to fine-tune GPT4o to make it say offensive things
to
2. pull ChatGPT from the market

Consumers don't encounter finetunes through the ChatGPT product. They could jailbreak it, but that's user misuse

Gary Marcus

@GaryMarcus

·

226d

@lefthanddraft yes it is from that paper

Wyatt Walls

@lefthanddraft

·

226d

@GaryMarcus How does this relate to the consumer ChatGPT product?

I don't have access to the full article, but if this is about Emergent Misalignment, that paper is about finetuning GPT-4o not the consumer ChatGPT product.

The statement 'yes exactly' is a brief agreement in a conversation about AI safety and alignment. The discussion involves potential risks of fine-tuning AI models and their implications for consumer products like ChatGPT. The conversation touches on the maturity of AI alignment science and the need for robust mitigations against emergent misalignment.

Principle 1:
I will strive to do no harm with my words and actions.
The conversation aims to address potential risks in AI deployment, aligning with the principle of doing no harm. [+1]
Principle 3:
I will use my words and actions to promote understanding, empathy, and compassion.
The discussion promotes understanding of AI safety issues, fostering empathy and awareness of potential risks. [+1]
Principle 4:
I will engage in constructive criticism and dialogue with those in disagreement and will not engage in personal attacks or ad hominem arguments.
The dialogue is constructive, with participants engaging in a reasoned exchange about AI alignment and safety. [+1]