The contentious issue of OpenAI's GPT-4 model performance over time is a complex one that needs an unbiased exploration. The following essay will examine this topic from a variety of perspectives, addressing the recent Stanford University and University of California, Berkeley study, reactions from various experts, and the overarching issue of transparency in the AI sector.

Is GPT-4's Performance Deteriorating Over Time?

Researchers from Stanford University and the University of California, Berkeley recently published a research paper suggesting a deterioration in GPT-4's capabilities over time.

This claim supports an ongoing but unproven belief in the AI community that GPT-4's proficiency in coding and compositional tasks has declined in recent months.

The Study and its Findings

The study titled "How Is ChatGPT’s Behavior Changing over Time?" is authored by Lingjiao Chen, Matei Zaharia, and James Zou. The researchers tested the March and June 2023 versions of GPT-4 and GPT-3.5 on various tasks, including math problem-solving, code generation, and visual reasoning.

Their results indicate a dramatic drop in GPT-4's prime number identification skills, from 97.6 percent accuracy in March to a mere 2.4 percent in June. In contrast, GPT-3.5's performance seemed to improve during the same period.

Expert Opinions: A Mixed Bag of Reactions

While some view the research as supporting evidence for GPT-4's declining performance, others contest these findings. Arvind Narayanan, a computer science professor at Princeton University, argued that the study's findings could be attributed to OpenAI's fine-tuning adjustments rather than a decrease in capability.

Counter Arguments and Criticisms

Narayanan criticized the research methodology, especially the lack of emphasis on the correctness of generated code. Another AI researcher, Simon Willison, found faults with the paper's methods, doubting its validity due to concerns over the models' settings and the emphasis on cosmetic aspects of code output rather than its functionality.

The Bigger Issue: OpenAI's Lack of Transparency

The debate over GPT-4's performance shifts focus to a broader problem with how OpenAI handles its model releases. The company is often criticized for its opaque approach, not revealing critical information about training materials, source code, or model architecture.

The Black Box Problem

The lack of transparency transforms models like GPT-4 into "black boxes," making it challenging for researchers to define the properties of a system that could have unknown components. Changes to these models can occur without notice, causing disruption and uncertainty for developers and researchers alike.

Calls for Open Source Solutions

Several voices in the AI community, such as writer and futurist Daniel Jeffries, have advocated for the adoption of traditional software infrastructure best practices. This includes long-term support for older versions of models and more openness through source-available models like Meta's Llama. Such measures would enable researchers to work from a stable baseline and offer repeatable results over time.

OpenAI and Accountability

As the AI landscape continues to evolve, accountability and transparency become increasingly crucial. AI researcher Sasha Luccioni from Hugging Face, echoed this sentiment, highlighting the necessity for model creators to provide access to underlying models for auditing purposes.

Need for Standardized Benchmarks

Luccioni also pointed out the absence of standardized benchmarks in the field, hindering the comparison of different model versions. Along with the results from common benchmarks like SuperGLUE and WikiText, developers should provide raw results from bias benchmarks such as BOLD and HONEST.

The issue at hand isn't merely about GPT-4's performance or a single research paper. It represents a broader debate around transparency, reproducibility, and accountability in AI, all of which need to be urgently addressed as AI continues to influence our world.

Share this post