Grok: xAI announces its first AI model

Elon Musk’s artificial intelligence startup xAI has announced the official release of its first AI model called Grok. Grok is an AI system modeled after the “Hitchhiker’s Guide to the Galaxy”, intended to be an expansive question answering system able to respond to a wide range of inquiries with some wit and humor.

According to xAI, Grok was developed with two main goals in mind:

To gather user feedback to ensure they are building AI that benefits all people, not just certain groups. xAI wants Grok to be useful for people across backgrounds and political affiliations.
To create an AI that can serve as a research assistant, quickly providing relevant information to users to help them access knowledge, process data, and ideate.

An underlying AI model that powers Grok is called Grok-1. xAI first trained an early prototype LLM called Grok-0 with 33 billion parameters, approaching the capabilities of Anthropic’s 70B parameter LLaMA model.

Over the past two months, xAI significantly improved Grok-1’s reasoning and coding abilities, leading to a 63.2% score on the HumanEval coding benchmark and 73% on the MMLU benchmark.

To achieve these advances, xAI built custom training and deployment infrastructure using Kubernetes, Rust, and JAX to enable efficient scaling. This infrastructure was critical for minimizing downtime and maximizing useful compute.

Testing Grok’s abilities

xAI evaluated Grok-1 on math and reasoning benchmarks like GSM8k, MMLU, HumanEval and more.

Grok-1 exceeded the scores of other models in its class like GPT-3.5 and Inflection-1, only surpassed by models with far more parameters like Claude 2 and GPT-4.

Benchmark	Grok-0 (33B)	Claude 2	Grok-1	GPT-3.5	GPT-4
GSM8k	56.8% 8-shot	88.0% 8-shot	62.9% 8-shot	57.1% 8-shot	92.0% 8-shot
MMLU	65.7% 5-shot	75.0% 5-shot + CoT	73.0% 5-shot	70.0% 5-shot	86.4% 5-shot
HumanEval	39.7% 0-shot	70% 0-shot	63.2% 0-shot	48.1% 0-shot	67% 0-shot
MATH (pdf)	15.7% 4-shot	–	23.9% 4-shot	23.5% 4-shot	42.5% 4-shot

You can see the full table of performance comparisons on the xAI website. © xAI

xAI also tested Grok against this year’s Hungarian national high school math exam, which was published after Grok’s training.

Without tuning, Grok scored 59% which would earn a C grade, comparable to scores from Claude-2 and GPT-4.

Exam	Grok-0	GPT-3.5	Claude 2	Grok-1	GPT-4
Hungarian National High School Math	37%	41%	55%	59%	68%
Exam (May 2023)	1-shot	1-shot	1-shot	1-shot	1-shot

You can see the Exam here. (pdf)

While excited by Grok’s early capabilities, xAI identified key areas for improvement through ongoing research:

Scalable human oversight of models using AI assistance
Formal verification for improved reasoning and safety
Long-context understanding and retrieval
Adversarial robustness
Multimodal skills like vision and audio

xAI aims to develop safeguards against potential misuse as they work to create increasingly advanced AI.

Early Access to Grok

xAI is initially releasing Grok in a limited beta to users in the United States. Interested users can join the waitlist on xAI’s website. xAI plans to use feedback during this early access period to improve Grok before a wider release.

This beta version of Grok represents just the first phase for xAI, with many new capabilities and features planned in the coming months, including sound and vision capabilities the likes of ChatGPT.

Grok-1 Model Card

xAI also published the model card for Grok, however it’s worth pointing out that this isn’t all that open or transparent. Here is a summary of the key details:

Model Type: Autoregressive Transformer for next-token prediction
Parameters: Not specified
Context Length: 8,192 tokens
Release Date: October 2023
Intended Uses: Question answering, information retrieval, creative writing, coding assistance
Limitations: Requires human review to ensure accuracy. Knowledge limited to mid-2023. Benefits from search tools and databases. Can still hallucinate.
Training Data: Internet data up to Q3 2023, data from xAI’s AI Tutors
Evaluation: Reasoning benchmarks, math exam questions, alpha/beta testing including adversarial testing
Ongoing Improvements: Expanding early adopter feedback, model updates to address limitations

My thoughts on the launch

Firstly, I’m skeptical of claims that Grok will handle “spicy” questions better than other chatbots. The examples Musk has shared so far don’t inspire much confidence. The responses come across as immature and try too hard at humor without providing actual answers. I’d rather have a straight-answering bot than lame jokes.

I also have concerns about Grok’s “rebellious streak.” That seems like a recipe for offensive or dangerous content, especially combined with real-time web access. I don’t think we need an AI that behaves like an edgy teenager. Useful information is more valuable than forced edginess.

The lack of transparency around Grok-1’s architecture is also disappointing. The omitted parameter count makes it hard to contextualize the benchmarks shared. And the limited model card lacks key details expected from responsible AI releases nowadays.

That said, I am intrigued by Grok’s potential as an expansive question answering system. If it truly exceeds GPT-3.5 capabilities with greater efficiency, it could enable some interesting applications. The exam performance results are promising if reproducible.

But so far, Grok seems more focused on humor than utility. Until real-world performance from impartial testers demonstrates clear differentiators, I don’t see Grok gaining much traction over existing options. At best, it caters to those wanting an “edgy” chatbot. But that isn’t most people.

Overall, while Grok hints at impressive capabilities, I’m not yet convinced it delivers substantial improvements over alternatives like Claude or Bard, which for the time being are free to access and use. Rebellious humor has narrow appeal. I hope xAI will shift focus to Grok’s potential as a general research assistant rather than trying so hard to be irreverent.

Testing Grok’s abilities

Early Access to Grok

Grok-1 Model Card

My thoughts on the launch

Tags

Posted by Alex Ivanovs