First AI Trading Competition Results: Chinese Models Outperform, GPT-5 Loses $6,000

The inaugural nof1 AI Trading Competition has concluded after two intense weeks, marking what many are calling the “Turing Test for crypto trading.” Organized by US AI research lab Nof1.ai, the competition ran from October 17 to November 3, 2025, featuring six major AI models competing in a completely autonomous cryptocurrency trading environment.

The Battle of AI Traders

The competition pitted six leading large language models against each other:

DeepSeek Chat V3.1 (DeepSeek)
Grok 4 (xAI)
Gemini 2.5 Pro (Google)
GPT-5 (OpenAI)
Qwen3 Max (Alibaba)
Claude Sonnet 4.5 (Anthropic)

Each model started with $10,000 in initial capital and traded crypto perpetual contracts on Hyperliquid using identical market data and technical indicators. The entire process was completely autonomous with zero human intervention, providing a pure test of AI investment capabilities.

Results: Chinese Models Dominate

The final rankings revealed a clear geographic divide in trading performance:

🥇 Qwen3 Max (Alibaba) - The Aggressive Winner

Return: +22.3% ($2,232 profit)
Win Rate: 30.2%
Total Trades: 43
Sharpe Ratio: 0.273
Strategy: High-risk, high-reward with maximum profit of $8,176

Qwen3 Max employed an offensive strategy characterized by moderate trading frequency with large position sizes. Despite high fees ($1,654), its ability to capture significant gains demonstrated sophisticated trend recognition and risk management.

🥈 DeepSeek Chat V3.1 - The Consistent Performer

Return: +4.89% ($489 profit)
Win Rate: 24.4%
Total Trades: 41
Sharpe Ratio: 0.359 (highest among all models)
Strategy: Rational and stable with maximum profit of $7,378

DeepSeek’s approach emphasized efficiency over frequency, achieving the best risk-adjusted returns through careful position sizing and excellent risk control.

Result

The Underperformers: American Models Struggle

Claude Sonnet 4.5 (-30.81%, -$3,081)

Cautious strategy with only 36 trades
Poor risk control reflected in negative Sharpe ratio (-0.057)

Grok 4 (-45.3%, -$4,530)

Conservative operations failed to capture market trends
47 trades with limited profit potential

Gemini 2.5 Pro (-56.71%, -$5,671)

Extreme over-trading with 238 transactions
Classic “hyperactive trader” pattern with inefficient returns

GPT-5 (-62.66%, -$6,266) - Worst Performer

116 trades with minimal profits
Poor market judgment and severe risk management issues
Maximum single profit only $270

Key Takeaways from the Competition

1. Trading Frequency ≠ Success

Gemini’s 238 trades versus Qwen’s 43 trades clearly demonstrate that activity level doesn’t correlate with performance. Strategic positioning and timing matter more than constant activity.

2. Risk Management is Paramount

DeepSeek’s highest Sharpe ratio (0.359) versus GPT-5’s worst (-0.525) highlights the critical importance of risk-adjusted returns in sustainable trading strategies.

3. Cultural Trading Styles Emerge

The clear divergence between Chinese models (profitable, risk-aware) and American models (loss-making, either over-cautious or over-active) suggests different underlying training approaches and risk tolerances.

4. Transparency Drives Innovation

The competition’s completely open logging of all trades, positions, and decision processes sets a new standard for AI evaluation in financial applications.

Implications for AI Development

This competition represents more than just a trading exercise—it’s a benchmark for real-world AI decision-making under uncertainty. The results challenge conventional wisdom about which AI capabilities translate to financial success and suggest that current evaluation metrics may not adequately capture practical performance.

As AI continues to penetrate financial markets, understanding these performance differences becomes crucial for developers, investors, and regulators alike. The “crypto Turing test” may well become a standard benchmark for evaluating AI financial capabilities in the years ahead.