Key Concepts
Determined fabricators usually cannot fake entire journals or articles or authors or websites recursively
Historically, debunking a claim usually only required checking citations 1-3 levels deep
At the edge of their citation graph, determined fabricators are forced to either make up the claim without citation/sourcing or risk a miscitation by misattributing it to an innocent real source
With generative models, it will become entirely possible to close the loops on fabricated claims and their citations
Once the loops on fabricated claims and citations are closed, the fact-checker is in serious trouble
Once the loops on fabricated claims and citations are closed, it is no longer enough to simply trace a few citations from the comfort of one's computer
Once the loops on fabricated claims and citations are closed, is it no longer enough to simply verify a watermark indicating that some text passed through OpenAI’s servers at some point
Once the loops on fabricated claims and citations are closed, one may have to start doing real-world things to establish that authors didn’t exist or the events did not happen
Doing real-world things to establish that authors didn't exist or the events did not happen is both a lot of work and itself highly unreliable
Once the loops on fabricated claims and citations are closed, the fake world will be pulled up into the generative models by later scrapes
Once the loops on fabricated claims and citations are closed, the fake world will start influencing all future generations
The rise of generative models will produce serious problems
We cannot hope to stop fake synthetic media abuse in general
Synthetic media has no truth or semantics to begin with
In trying to watermark synthetic media, we don't need to worry about preserving its truth or semantics
In trying to create a watermark for synthetic media, we need a system which ensures that there is always some tell-tale sign of it being fake
Users need the ability to know whether media is fake
Watermarking generated media with some tell-tale sign of it being fake allows users to manually check the legitimacy of the media
Watermarking generated media with an error-ensuring code is an effective way to ensure its ability to be verified as generated
It is important to be able to verify generated media as being generated
Generated media should be watermarked with an error-ensuring code
Watermarking generated media with an error-ensuring code would fail to preserve the its truth or semantics
A good method for watermarking synthetic media using an error-ensuring code is to insert errors which are locally consistent but globally inconsistent
A user can simply detect locally consistent but globally inconsistent errors inserted into synthetic media by randomly sampling pairs of passages to look for any contradictions
It is easier to create or detect locally consistent but globally inconsistent errors in synthetic media than it is to fix all the errors
An attacker will need to work hard to repair a large corpus of synthetic media containing locally consistent but globally inconsistent errors
An attacker looking to repair a large corpus of synthetic media containing locally consistent but globally inconsistent errors will only add more errors if they use the same generator
For pragmatic, utilitarian purposes, watermarking media is neither necessary or sufficient
Completely legitimate media can be watermarked
Useful media can usually be extracted & rewritten as a kind of ‘analogue hole’
Synthetic media creates realistic, self-consistent, fictional worlds which cannot be distinguished from real-world data
Synthetic media verisimilitude ‘pollutes’ commons
Synthetic media leaks out of its intended uses
Synthetic media cannot be any more true than its sources
Synthetic media cannot add value to its true sources
Synthetic media is often mistaken for true real media
We are increasingly seeing multimodal synthetic media
Websim is a recent project that allows the creation of websites where all text, photos, illustrations, JS/CSS/HTML, associated Github repos, and even videos are synthetic
Historically, it was usually not that hard to debunk or find the small grain of truth behind a claim
Determined fabricators usually cannot achieve "epistemic closure"
Logical Relationships
The rise of generative models will produce serious problems implies It is important to be able to verify generated media as being generated.
Determined fabricators usually cannot achieve "epistemic closure" implies At the edge of their citation graph, determined fabricators are forced to either make up the claim without citation/sourcing or risk a miscitation by misattributing it to an innocent real source.
An attacker looking to repair a large corpus of synthetic media containing locally consistent but globally inconsistent errors will only add more errors if they use the same generator implies An attacker will need to work hard to repair a large corpus of synthetic media containing locally consistent but globally inconsistent errors.
We are increasingly seeing multimodal synthetic media implies It is important to be able to verify generated media as being generated.
An attacker will need to work hard to repair a large corpus of synthetic media containing locally consistent but globally inconsistent errors implies A good method for watermarking synthetic media using an error-ensuring code is to insert errors which are locally consistent but globally inconsistent.
At the edge of their citation graph, determined fabricators are forced to either make up the claim without citation/sourcing or risk a miscitation by misattributing it to an innocent real source implies Historically, debunking a claim usually only required checking citations 1-3 levels deep.
In trying to create a watermark for synthetic media, we need a system which ensures that there is always some tell-tale sign of it being fake implies Watermarking generated media with an error-ensuring code is an effective way to ensure its ability to be verified as generated.
In trying to watermark synthetic media, we don't need to worry about preserving its truth or semantics and Watermarking generated media with an error-ensuring code would fail to preserve the its truth or semantics implies Watermarking generated media with an error-ensuring code is an effective way to ensure its ability to be verified as generated.
Websim is a recent project that allows the creation of websites where all text, photos, illustrations, JS/CSS/HTML, associated Github repos, and even videos are synthetic implies We are increasingly seeing multimodal synthetic media.
Once the loops on fabricated claims and citations are closed, one may have to start doing real-world things to establish that authors didn’t exist or the events did not happen and Doing real-world things to establish that authors didn't exist or the events did not happen is both a lot of work and itself highly unreliable implies Once the loops on fabricated claims and citations are closed, the fact-checker is in serious trouble.
With generative models, it will become entirely possible to close the loops on fabricated claims and their citations and Once the loops on fabricated claims and citations are closed, the fact-checker is in serious trouble implies The rise of generative models will produce serious problems.
Synthetic media leaks out of its intended uses implies Synthetic media verisimilitude ‘pollutes’ commons.
Watermarking generated media with an error-ensuring code is an effective way to ensure its ability to be verified as generated and It is important to be able to verify generated media as being generated implies Generated media should be watermarked with an error-ensuring code.
Historically, debunking a claim usually only required checking citations 1-3 levels deep implies Historically, it was usually not that hard to debunk or find the small grain of truth behind a claim.
Synthetic media verisimilitude ‘pollutes’ commons implies It is important to be able to verify generated media as being generated.
Completely legitimate media can be watermarked and Useful media can usually be extracted & rewritten as a kind of ‘analogue hole’ implies For pragmatic, utilitarian purposes, watermarking media is neither necessary or sufficient.
Synthetic media has no truth or semantics to begin with implies In trying to watermark synthetic media, we don't need to worry about preserving its truth or semantics.
Once the loops on fabricated claims and citations are closed, it is no longer enough to simply trace a few citations from the comfort of one's computer and Once the loops on fabricated claims and citations are closed, is it no longer enough to simply verify a watermark indicating that some text passed through OpenAI’s servers at some point implies Once the loops on fabricated claims and citations are closed, one may have to start doing real-world things to establish that authors didn’t exist or the events did not happen.
Once the loops on fabricated claims and citations are closed, the fake world will be pulled up into the generative models by later scrapes implies Once the loops on fabricated claims and citations are closed, the fake world will start influencing all future generations.
Determined fabricators usually cannot fake entire journals or articles or authors or websites recursively implies Determined fabricators usually cannot achieve "epistemic closure".
Users need the ability to know whether media is fake and Watermarking generated media with some tell-tale sign of it being fake allows users to manually check the legitimacy of the media implies In trying to create a watermark for synthetic media, we need a system which ensures that there is always some tell-tale sign of it being fake.
Synthetic media cannot be any more true than its sources and Synthetic media cannot add value to its true sources and Synthetic media is often mistaken for true real media implies Synthetic media verisimilitude ‘pollutes’ commons.
It is easier to create or detect locally consistent but globally inconsistent errors in synthetic media than it is to fix all the errors implies A good method for watermarking synthetic media using an error-ensuring code is to insert errors which are locally consistent but globally inconsistent.
With generative models, it will become entirely possible to close the loops on fabricated claims and their citations and Once the loops on fabricated claims and citations are closed, the fake world will start influencing all future generations implies The rise of generative models will produce serious problems.
A user can simply detect locally consistent but globally inconsistent errors inserted into synthetic media by randomly sampling pairs of passages to look for any contradictions implies It is easier to create or detect locally consistent but globally inconsistent errors in synthetic media than it is to fix all the errors.
We cannot hope to stop fake synthetic media abuse in general implies It is important to be able to verify generated media as being generated.