Elon Musk’s artificial intelligence startup xAI has announced the official release of its first AI model called Grok. Grok is an AI system modeled after the “Hitchhiker’s Guide to the Galaxy”, intended to be an expansive question answering system able to respond to a wide range of inquiries with some wit and humor.
According to xAI, Grok was developed with two main goals in mind:
- To gather user feedback to ensure they are building AI that benefits all people, not just certain groups. xAI wants Grok to be useful for people across backgrounds and political affiliations.
- To create an AI that can serve as a research assistant, quickly providing relevant information to users to help them access knowledge, process data, and ideate.
An underlying AI model that powers Grok is called Grok-1. xAI first trained an early prototype LLM called Grok-0 with 33 billion parameters, approaching the capabilities of Anthropic’s 70B parameter LLaMA model.
Over the past two months, xAI significantly improved Grok-1’s reasoning and coding abilities, leading to a 63.2% score on the HumanEval coding benchmark and 73% on the MMLU benchmark.
To achieve these advances, xAI built custom training and deployment infrastructure using Kubernetes, Rust, and JAX to enable efficient scaling. This infrastructure was critical for minimizing downtime and maximizing useful compute.
Testing Grok’s abilities
xAI evaluated Grok-1 on math and reasoning benchmarks like GSM8k, MMLU, HumanEval and more.
Grok-1 exceeded the scores of other models in its class like GPT-3.5 and Inflection-1, only surpassed by models with far more parameters like Claude 2 and GPT-4.
Benchmark | Grok-0 (33B) | Claude 2 | Grok-1 | GPT-3.5 | GPT-4 |
---|---|---|---|---|---|
GSM8k | 56.8% 8-shot | 88.0% 8-shot | 62.9% 8-shot | 57.1% 8-shot | 92.0% 8-shot |
MMLU | 65.7% 5-shot | 75.0% 5-shot + CoT | 73.0% 5-shot | 70.0% 5-shot | 86.4% 5-shot |
HumanEval | 39.7% 0-shot | 70% 0-shot | 63.2% 0-shot | 48.1% 0-shot | 67% 0-shot |
MATH (pdf) | 15.7% 4-shot | – | 23.9% 4-shot | 23.5% 4-shot | 42.5% 4-shot |
xAI also tested Grok against this year’s Hungarian national high school math exam, which was published after Grok’s training.
Without tuning, Grok scored 59% which would earn a C grade, comparable to scores from Claude-2 and GPT-4.
Exam | Grok-0 | GPT-3.5 | Claude 2 | Grok-1 | GPT-4 |
---|---|---|---|---|---|
Hungarian National High School Math | 37% | 41% | 55% | 59% | 68% |
Exam (May 2023) | 1-shot | 1-shot | 1-shot | 1-shot | 1-shot |
While excited by Grok’s early capabilities, xAI identified key areas for improvement through ongoing research:
- Scalable human oversight of models using AI assistance
- Formal verification for improved reasoning and safety
- Long-context understanding and retrieval
- Adversarial robustness
- Multimodal skills like vision and audio
xAI aims to develop safeguards against potential misuse as they work to create increasingly advanced AI.
Early Access to Grok
xAI is initially releasing Grok in a limited beta to users in the United States. Interested users can join the waitlist on xAI’s website. xAI plans to use feedback during this early access period to improve Grok before a wider release.
This beta version of Grok represents just the first phase for xAI, with many new capabilities and features planned in the coming months, including sound and vision capabilities the likes of ChatGPT.
Grok-1 Model Card
xAI also published the model card for Grok, however it’s worth pointing out that this isn’t all that open or transparent. Here is a summary of the key details:
- Model Type: Autoregressive Transformer for next-token prediction
- Parameters: Not specified
- Context Length: 8,192 tokens
- Release Date: October 2023
- Intended Uses: Question answering, information retrieval, creative writing, coding assistance
- Limitations: Requires human review to ensure accuracy. Knowledge limited to mid-2023. Benefits from search tools and databases. Can still hallucinate.
- Training Data: Internet data up to Q3 2023, data from xAI’s AI Tutors
- Evaluation: Reasoning benchmarks, math exam questions, alpha/beta testing including adversarial testing
- Ongoing Improvements: Expanding early adopter feedback, model updates to address limitations
My thoughts on the launch
Firstly, I’m skeptical of claims that Grok will handle “spicy” questions better than other chatbots. The examples Musk has shared so far don’t inspire much confidence. The responses come across as immature and try too hard at humor without providing actual answers. I’d rather have a straight-answering bot than lame jokes.
I also have concerns about Grok’s “rebellious streak.” That seems like a recipe for offensive or dangerous content, especially combined with real-time web access. I don’t think we need an AI that behaves like an edgy teenager. Useful information is more valuable than forced edginess.
The lack of transparency around Grok-1’s architecture is also disappointing. The omitted parameter count makes it hard to contextualize the benchmarks shared. And the limited model card lacks key details expected from responsible AI releases nowadays.
That said, I am intrigued by Grok’s potential as an expansive question answering system. If it truly exceeds GPT-3.5 capabilities with greater efficiency, it could enable some interesting applications. The exam performance results are promising if reproducible.
But so far, Grok seems more focused on humor than utility. Until real-world performance from impartial testers demonstrates clear differentiators, I don’t see Grok gaining much traction over existing options. At best, it caters to those wanting an “edgy” chatbot. But that isn’t most people.
Overall, while Grok hints at impressive capabilities, I’m not yet convinced it delivers substantial improvements over alternatives like Claude or Bard, which for the time being are free to access and use. Rebellious humor has narrow appeal. I hope xAI will shift focus to Grok’s potential as a general research assistant rather than trying so hard to be irreverent.