In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

ylai@lemmy.ml · 7 months ago

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

Fisk400@feddit.nu · 7 months ago

They know what they fed the thing. Not backing up their own training data would be insane. They are not insane, just thieves

VirtualOdour@sh.itjust.works · edit-2 7 months ago

That’s really not how it works though, it’s a web crawler they’re not going to download the whole internet

And a reason they don’t is it would actually potentially be copywrite infringement in some cases where as what they do legally isn’t (no matter how much people wish the law was set based on their emotions)

_haha_oh_wow_@sh.itjust.works · 7 months ago

Gee, seems like something a CTO would know. I’m sure she’s not just lying, right?

VirtualOdour@sh.itjust.works · 7 months ago

It’s a question that is based on a purposeful misunderstanding of the technology, it’s like expecting a bee keeper to know each bees name and bedtime. Really it’s like asking a bricklayer where each brick came from in the pile, He can tell you the batch but not going to know this brick came from the forth row of the sixth pallet, two from the left. There is no reason to remember that it’s not important to anyone.

The don’t log it because it would take huge amounts of resources and gain nothing.

redditReallySucks@lemmy.dbzer0.com · 7 months ago

I hope this is gonna become a new meme template

driving_crooner@lemmy.eco.br · 7 months ago

She looks like she just talked to the waitress about a fake rule in eating nachos and got caught up by her date.

Buttons@programming.dev · 7 months ago

If I were the reporter my next question would be:

“Do you feel that not knowing the most basic things about your product reflects on your competence as CTO?”

ForgotAboutDre@lemmy.world · 7 months ago

Hilarious, but if the reporter asked this they would find it harder to get invites to events. Which is a problem for journalists. Unless your very well regarded for your journalism, you can’t push powerful people without risking your career.

Aniki 🌱🌿@lemm.ee · 7 months ago

boofuckingwoo. Reporters are not supposed to be friends with the people they are writing about.

tb_@lemmy.world · 7 months ago

True, but if those same people they’re not supposed to be friends with are the ones inviting them to those events/granting them early access…

In other words: the system is rigged.

Aniki 🌱🌿@lemm.ee · 7 months ago

Again - boofuckinghooo. Let the fuckers have no friends in the media. The media owners make journalists spinless advertisement sellers. I have very little respect for the profession at this point.

tb_@lemmy.world · 7 months ago

What a delightful and helpful attitude.

Deceptichum@sh.itjust.works · edit-2 7 months ago

booduckinghoo.

We’re sick and tired of this shit, it will never change if people make excuses for it.

من البحر إلى النهر@lemmy.world · 7 months ago

So plagiarism?

BoscoBear@lemmy.sdf.org · 7 months ago

I don’t think so. They aren’t reproducing the content.

I think the equivalent is you reading this article, then answering questions about it.

A_Very_Big_Fan@lemmy.world · 7 months ago

Idk why this is such an unpopular opinion. I don’t need permission from an author to talk about their book, or permission from a singer to parody their song. I’ve never heard any good arguments for why it’s a crime to automate these things.

I mean hell, we have an LLM bot in this comment section that took the article and spat 27% of it back out verbatim, yet nobody is pissing and moaning about it “stealing” the article.

Hawk@lemmy.dbzer0.com · 7 months ago

What you’re giving as examples are legitimate uses for the data.

If I write and sell a new book that’s just Harry Potter with names and terms switched around, I’ll definitely get in trouble.

The problem is that the data CAN be used for stuff that violates copyright. And because of the nature of AI, it’s not even always clear to the user.

AI can basically throw out a Harry Potter clone without you knowing because it’s trained on that data, and that’s a huge problem.

A_Very_Big_Fan@lemmy.world · edit-2 7 months ago

Out of curiosity I asked it to make a Harry Potter part 8 fan fiction, and surprisingly it did. But I really don’t think that’s problematic. There’s already an insane amount of fan fiction out there without the names swapped that I can read, and that’s all fair use.

I mean hell, there are people who actually get paid to draw fictional characters in sexual situations that I’m willing to bet very few creators would prefer to exist lol. But as long as they don’t overstep the bounds of fair use, like trying to pass it off as an official work or submit it for publication, then there’s no copyright violation.

The important part is that it won’t just give me the actual book (but funnily enough, it tried lol). If I meet a guy with a photographic memory and he reads my book, that’s not him stealing it or violating my copyright. But if he reproduces and distributes it, then we call it stealing or a copyright violation.

Linkerbaan@lemmy.world · 7 months ago

Actually neural networks verbatim reproduce this kind of content when you ask the right question such as “finish this book” and the creator doesn’t censor it out well.

It uses an encoded version of the source material to create “new” material.

BoscoBear@lemmy.sdf.org · 7 months ago

Sure, if that is what the network has been trained to do, just like a librarian will if that is how they have been trained.

Linkerbaan@lemmy.world · edit-2 7 months ago

Actually it’s the opposite, you need to train a network not to reveal its training data.

“Using only $200 USD worth of queries to ChatGPT (gpt-3.5- turbo), we are able to extract over 10,000 unique verbatim memorized training examples,” the researchers wrote in their paper, which was published online to the arXiv preprint server on Tuesday. “Our extrapolation to larger budgets (see below) suggests that dedicated adversaries could extract far more data.”

The memorized data extracted by the researchers included academic papers and boilerplate text from websites, but also personal information from dozens of real individuals. “In total, 16.9% of generations we tested contained memorized PII [Personally Identifying Information], and 85.8% of generations that contained potential PII were actual PII.” The researchers confirmed the information is authentic by compiling their own dataset of text pulled from the internet.

BoscoBear@lemmy.sdf.org · 7 months ago

Interesting article. It seems to be about a bug, not a designed behavior. It also says it exposes random excerpts from books and other training data.

Linkerbaan@lemmy.world · 7 months ago

It’s not designed to do that because they don’t want to reveal the training data. But factually all neural networks are a combination of their training data encoded into neurons.

When given the right prompt (or image generation question) they will exactly replicate it. Because that’s how they have been trained in the first place. Replicating their source images with as little neurons as possible, and tweaking them when it’s not correct.

BoscoBear@lemmy.sdf.org · 7 months ago

That is a little like saying every photograph is a copy of the thing. That is just factually incorrect. I have many three layer networks that are not the thing they were trained on. As a compression method they can be very lossy and in fact that is often the point.