Hi, thanks for your release, I'm evaluating the official checkpoints of Light-R1-7B-DS. The evaluation scores of AIME24 and AIME25 are similar to your paper, but GPQA is much lower than paper. The scores are shown in table below. Could you tell me where the problem is? Thx.
| Model |
AIME24 |
AIME25 |
GPQA |
| DeepSeek-R1-Distill-Qwen-7B |
55.21 |
40.94 |
35.73 (49.1 reported in your paper) |
| Light-R1-7B-DS |
56.98 |
45.63 |
25.36 (49.4 reported in your paper) |
Here's my evaluation process, following Light-R1 Evaluation Usage :
- Create environment
# Installing Python 3.10 Environment.
conda create -n deepscaler python=3.10 -y
conda activate deepscaler
cd deepscaler
pip install -e ./verl
pip install -e .
- Running evaluation script
./scripts/eval/eval_model.sh --model [CHECKPOINT_PATH] --datasets aime aime25 gpqa --output-dir [OUTPUT_DIR]