Truth in the age of Social Media & Large Language Models
The internet is not a physical object, so it does not have a fixed storage capacity. However, the total amount of data that can be stored on the internet is limited by the number of servers and other devices that are connected to the internet. The capacity of these devices is constantly increasing, so the total amount of data that can be stored on the internet is also increasing.
Data Growth
The amount of data on the internet has been growing exponentially. This growth is driven by the increasing number of internet users, the proliferation of smart devices, the rise of video content, the expansion of social media, and of course Generative AI that’s making it easier than every before to create content. the growing number of websites and online platforms.
Can anyone guess how much data is on the internet today?
Approximately 328.77 million terabytes of data are created each day
Videos account for over half of internet data traffic
Around 120 zettabytes of data will be generated this year
It is estimated that 181 zettabytes of data will be generated in 2025 - x36 Manhattans worth of data centres
Every 48 hours, we generate more data than all the words spoken by every human being who’s ever lived on Earth since our species first learned to communicate with words. Every 2 days.
The amount of data on the internet is constantly growing, and it is difficult to say exactly how much data there is at any given time. However, estimates suggest that the total amount of data on the internet reached 175 zettabytes in 2022. A zettabyte is equal to 1,000 exabytes, or 1 trillion gigabytes.
The growth of the internet is being driven by a number of factors, including the increasing use of cloud computing, social media, and streaming video. As these technologies become more popular, the amount of data that is stored on the internet will continue to grow.
Size Estimates:
Various estimates have been made over the years. For instance, a report from IDC (International Data Corporation) suggested that the global datasphere would reach 175 zettabytes by 2025. A zettabyte is a trillion gigabytes, indicating an almost incomprehensible scale.
Types of Data:
The internet hosts a wide variety of data types, including text, images, videos, music, databases, and increasingly, large volumes of machine-generated data from sources like IoT (Internet of Things) devices.
Dynamic Nature:
The internet is not a static entity. Data is constantly being created, updated, and deleted. Social media posts, news articles, blog entries, videos, and other forms of content are added continuously.
Dark Data and the Deep Web:
Beyond the surface web, which is indexed by search engines, there's a vast amount of data in the deep web (not indexed and often behind paywalls or login screens) and the dark web (intentionally hidden and accessible only through special software like Tor). This part of the internet is difficult to quantify but adds significantly to the total amount of data.
User-Generated Content:
Platforms like YouTube, Facebook, Instagram, and Twitter see massive amounts of user-generated content uploaded every minute. For instance, as of my last update, over 500 hours of video were uploaded to YouTube every minute.
The US has over 2,700 data centers
Because of the sheer volume of data on the Internet and generated globally each day is simply too great for the US Intelligence Community to fully analyze and monitor. The NSA’s have built a data facility in Utah to capture every bit of internet traffic in the world for storage and safekeeping. The strategy is known internally as “Rewind Time” and “History Backup”,
But if, hypothetically, a nuclear suitcase bomb went off in Washington, DC, investigators would turn to this NSA facility in the Utah hinterlands. While the government isn’t believed to conduct realtime surveillance of all web traffic in real time, once they have a date and time with an event, ‘rewind time’ allows the NSA to look at everything that people read, wrote, shared and messaged in the hours, days and weeks prior to an event like a massive terrorist attack.
What you see here in these pictures is just a cooling system, as the supercomputer arrays are in fact found underground.
Beyond the sheer size of the internet, which is decentralized, unwieldy and ever growing there are two main issues:-
Inaccurate or false information
Harmful, toxic and damaging
A now we throw into the mix Generative AI, a fantastic tool for creating Synthetic data - text, video, music and 3D.
For creating imitation art
To beautiful portraits of your pets, Check out the paws on that cat - it must be a mankoon
But all jokes aside Generative AI will unfortunately exacerbate the problem.
With the invention of Generative AI, these issues are about to get a lot larger.
User’s to create fake synthetic content easier than ever before
Volume of data that’s being created
Training data - Because Gen AI models are being trained on social networks and news publishers so we need to trust that information to begin with.
Now when we take a deeper look at misinformation. There are effectively two categories of misinformation.
There are largely two sources of Misinformation on the internet:-
Inadvertent Misinformation
Deliberate misinformation - fake news, which deliberately aims to deceive and manipulate people
The term "fake news" has gained significant popularity and usage in recent years, particularly in the context of politics and social media. However, the concept of false or misleading information being disseminated is not new. The origins of the term can be traced back to earlier practices of propaganda, misinformation, and sensational journalism, but its modern usage has evolved.
Historical Context:
The practice of spreading false information for political or personal gain has been around for centuries. Historically, this was often in the form of propaganda used by governments or influential individuals. However, the specific term "fake news" wasn't commonly used in this historical context.
Yellow Journalism: In the late 19th and early 20th centuries, a style of newspaper reporting known as "yellow journalism" emerged in the United States. This style was characterized by sensationalism, exaggerated stories, and sometimes outright fabrications designed to attract readers and influence public opinion. While the term "fake news" wasn't used to describe yellow journalism at the time, the concept is similar.
Modern Usage:
The specific term "fake news" began to gain widespread usage in the early 21st century, particularly in the context of online and social media content. It gained significant traction during and after the 2016 United States presidential election. In this period, "fake news" was used to describe false or misleading information presented as legitimate news, often spread through social media platforms and sometimes designed to influence political views or as clickbait for financial gain.
Evolution of the Term:
Initially, "fake news" referred specifically to stories that were factually incorrect or misleading. However, the term has evolved and has been used more broadly and sometimes controversially. Politicians and public figures, most notably including former U.S. President Donald Trump, have used the term "fake news" to discredit news stories or media outlets that they disagree with, regardless of the factual accuracy of the reporting.
Current Understanding:
Today, "fake news" is understood in several ways. It can refer to intentionally fabricated stories, misinformation (false or misleading information spread regardless of intent to deceive), and disinformation (false information spread with the intent to deceive). The term has become a point of contention in debates over media credibility, political propaganda, and the role of social media in public discourse.
In summary, while the practice of spreading false or misleading information is not new, the term "fake news" in its current context is a relatively recent development, heavily influenced by the rise of social media and the political landscape of the early 21st century.
Presidential Candidates and Political Campaigns:
These entities seek to influence public opinion and voter behavior to win elections. They often employ various strategies, including targeted advertising and social media campaigns, to reach potential voters.
Cambridge Analytica:
This was a political consulting firm known for its use of data mining and data analysis for political purposes. Cambridge Analytica came into the spotlight for its role in the 2016 U.S. presidential election and the Brexit referendum. The company claimed to be able to analyze large amounts of consumer data and develop psychological profiles of voters (often referred to as "psychographic" profiling) to influence them more effectively.
Publishers like Facebook:
Social media platforms and other online publishers play a crucial role as they possess vast amounts of user data. This data includes user demographics, interests, online behavior, social connections, and more. Platforms like Facebook use algorithms to determine what content is shown to which users. Political campaigns and firms like Cambridge Analytica used these platforms to target users with specific profiles for political advertising.
Bot Farms:
These are networks of automated accounts (bots) on social media platforms. Bot farms can be used to artificially amplify certain political messages, create an illusion of consensus, or spread misinformation. They interact with real users and content on platforms like Facebook, influencing the social media landscape and, potentially, public opinion.
Citizens/Voters:
The general public, especially voters, are the target of these political strategies. The data collected from citizens (like their interests, behaviors, and social networks) is used to create targeted content. This content, whether ads, posts, or fake news, is designed to influence their perceptions, opinions, and ultimately, their voting behavior.
The interaction between these entities typically follows this flow:
Presidential candidates hire firms like Cambridge Analytica to gain an edge in their campaigns.
Cambridge Analytica (and similar firms) use data analysis and psychographic profiling to understand voter behavior and preferences.
This information is used to create targeted political advertisements or content.
Publishers like Facebook serve as platforms for delivering this targeted content to specific groups of citizens, based on their profiles.
Bot farms may be employed to amplify certain messages or create misleading impressions of public opinion.
Citizens interact with this content, which may influence their opinions and voting decisions.
This ecosystem represents a modern, data-driven approach to political campaigning, where digital platforms play a central role in how political messages are crafted and disseminated. The ethical and legal implications of these practices, particularly regarding data privacy and the integrity of democratic processes, have been subjects of significant debate and regulatory scrutiny.
How could Generative AI potentially exacerbate this problem?
Generative AI could feed into this with deep fake images and video.
And when avatars begin to be powered by LLMs to answer your questions in real time, you may find yourself speaking to an avatar on zoom.
Companies like Metaphysic are fighting this eventuality by allowing famous people to selectively license out their Name, Image and Likeness; but simultaneously refusing to work with anyone within the political arena. Being non-political in the age of Generative AI is a good stance for any company to take.
Misinformation and Harmful content is a pervasive, systemic problem that can be fixed at three levels, in my view:-
Education
Regulation
Fact Checkers
Tech Product Design
My first call to action is to get the Education system onboard with teaching children how to think critically, fact check and to understand why and what to do when they come across harmful or fake information. Additionally how to be responsible and not contribute to the problem.
The ability to discern whether a piece of content is true or false, engage in critical thinking, and maintain skepticism involves multiple areas of the brain, particularly those associated with higher cognitive functions. This complex process is not localized to a single brain region but rather is the result of the integrated activity of several areas:
Prefrontal Cortex: This region, especially the dorsolateral prefrontal cortex (DLPFC), is crucial for higher-order cognitive processes including reasoning, decision making, and critical thinking. It plays a significant role in evaluating information, weighing evidence, and making judgments about the veracity of information.
Anterior Cingulate Cortex (ACC): The ACC is involved in conflict monitoring, error detection, and cognitive control. It is activated when a person encounters contradictory information or needs to evaluate the reliability of information, which is essential for skepticism and critical evaluation.
Temporal Lobes: These are involved in memory and processing auditory information (like language). Understanding and evaluating the content critically often requires retrieving and comparing stored information, a process in which the temporal lobes are involved.
Insular Cortex (Insula): This region is associated with emotional awareness and integrating emotional responses with cognitive processes. It can play a role in gut feelings or intuition about the truthfulness of information.
Parietal Lobes: Particularly the angular gyrus, which is involved in processes related to language, number processing, spatial cognition, and attention. It can contribute to the ability to focus on specific details of content when evaluating its truthfulness.
Orbitofrontal Cortex: This area is involved in decision-making and expectation. It plays a role in assessing the value and credibility of information and in making judgments based on this assessment.
Amygdala: While primarily known for its role in processing emotions, particularly fear, the amygdala also contributes to decision-making, especially in evaluating the emotional significance of information.
Critical thinking and skepticism are complex cognitive processes that require the integration of information processing, memory retrieval, logical reasoning, and sometimes emotional assessment. These brain regions work together, allowing us to evaluate, analyze, and decide on the truthfulness of the information we encounter. It's important to note that these processes are highly influenced by individual experiences, education, cognitive biases, and the current emotional state, which can affect how information is processed and interpreted.
Key Aspects of the EU AI Act Related to Misinformation:
Risk-Based Approach: The EU AI Act categorizes AI systems based on the level of risk they pose, ranging from unacceptable risk to minimal risk. AI systems that manipulate human behavior, exploit vulnerabilities of specific groups of persons, or allow 'social scoring' by governments fall under the prohibited or high-risk categories. This could include AI systems used to generate or amplify misinformation or harmful content.
Transparency Obligations: The Act imposes transparency obligations for certain AI systems. For instance, if an AI is used to create deepfakes or other synthetic media, the Act may require clear disclosure that content has been artificially generated or manipulated. This transparency can help in combating misinformation by making it easier for users to identify non-authentic content.
Quality and Data Governance: The Act emphasizes high standards for data quality and governance. AI systems must be trained, validated, and tested on high-quality datasets, reducing the risk of biases or inaccuracies that could contribute to misinformation.
Human Oversight: The Act requires high-risk AI systems to have appropriate human oversight to minimize risks. This could involve human review of AI-generated content to prevent the spread of harmful or misleading information.
Record-Keeping and Traceability: AI providers must keep records of their systems' functioning, which can be crucial in tracing the origins of AI-generated misinformation and addressing it.
Market Surveillance and Enforcement: The Act establishes mechanisms for market surveillance and enforcement, which includes the ability to withdraw or prohibit AI systems that pose a risk to public safety and health, which could extend to systems spreading harmful misinformation.
Limitations and Challenges:
Scope and Direct Impact: The EU AI Act is primarily concerned with the development and deployment of AI systems rather than directly regulating online content. Its impact on misinformation and harmful content will be more indirect, through the regulation of AI systems that could be used to create or disseminate such content.
Implementation and Enforcement: Effective implementation and enforcement across all EU member states will be crucial. This includes the ability of regulatory bodies to keep pace with rapidly evolving AI technologies.
Global Reach: The internet is global, and misinformation often crosses borders. The EU AI Act can primarily govern AI systems within its jurisdiction, and its effectiveness in dealing with misinformation will partly depend on cooperation and alignment with regulations in other countries.
And then we have the online safety bill, which put’s responsibility on social media providers to protect their users especially children from seeing harmful content:-
Content terrorist group
Hate groups
Sexual content
Risk assessments
However, at this stage the online safety bill is
manual, clunky, honorous. Asking social media providers to conduct risk assessments on all it’s content
It also doesn’t address the fact that many social networks are built on political discord.
It also doesn’t have a view into social media’s algorithms.
In some countries, government or independent regulatory bodies oversee media standards and practices. However, their effectiveness and impartiality can vary widely depending on the country's political environment.
Independent fact-checking organizations and watchdog groups scrutinize claims made in the media, by public figures, or on social media. They play an increasingly important role in verifying information and debunking false claims.
Social media networks have implemented various features and strategies to combat fake news and intellectual property (IP) infringement. These efforts vary by platform, reflecting the unique characteristics and challenges each faces. Here's an overview of some of the major social networks and their approaches as of my last update in April 2023:
Fact-Checking Program: Facebook partners with third-party fact-checkers who are certified through the non-partisan International Fact-Checking Network. Posts flagged as potentially false are reviewed, and if deemed misinformation, they are labeled and shown lower in the news feed.
AI and Machine Learning: Facebook uses AI to identify and reduce the spread of fake news and misinformation. This includes detecting duplicate content and patterns typically associated with fake news.
Reporting Tools: Users can report posts they believe to be false, which are then reviewed by Facebook or sent to third-party fact-checkers.
IP Protection: Facebook has tools like Rights Manager that use image matching technology to help creators and rights owners protect and manage their IP content.
Labels and Warnings: Twitter has introduced labels for tweets containing disputed or misleading information, particularly around COVID-19 and elections. These labels provide context and link to more information.
Manipulated Media Policy: Twitter marks and reduces the visibility of tweets containing manipulated media, such as deepfakes or edited videos intended to deceive.
Copyright Enforcement: Twitter responds to copyright complaints under the Digital Millennium Copyright Act (DMCA) and has a process for users to report copyright violations.
YouTube
Fact-Check Information Panels: YouTube provides information panels that show fact-checked articles related to certain topics, especially for content prone to misinformation.
Content ID System: To handle IP issues, YouTube's Content ID system allows rights holders to identify and manage their content on the platform. This system automatically scans uploaded videos against a database of filed content.
Community Guidelines: YouTube's community guidelines prohibit misleading content, including deepfakes or manipulated content intended to mislead.
Third-Party Fact-Checking: Owned by Facebook, Instagram also uses third-party fact-checkers to identify, review, and label false information.
Content Removal and Reporting: Instagram removes content that violates its community guidelines and allows users to report misinformation.
IP Protection: Instagram's IP policies are similar to Facebook's, with tools to report and manage copyright infringement.
Professional Context: LinkedIn's professional context and user base result in lower incidences of fake news, but it still has policies to remove false information.
User Reporting: Users can report content they believe is false or misleading, which is then reviewed by LinkedIn.
Copyright Policies: LinkedIn adheres to DMCA guidelines for IP and provides a process for reporting and addressing copyright infringement.
TikTok
Content Moderation: TikTok uses a combination of technology and human review to identify and remove false information and copyright violations.
Misinformation Policy: TikTok removes misinformation that could cause harm to individuals, the community, or the larger public.
Digital Wellbeing Features: Features like screen time management and comment filters help manage the user experience and reduce the spread of harmful content.
Snapchat
Content Curation: Unlike other platforms, Snapchat curates much of its content through Discover, reducing the spread of fake news.
Community Guidelines: Snapchat's guidelines prohibit deceptive practices and misinformation.
Copyright Compliance: Snapchat complies with DMCA guidelines and allows users to report IP infringements.