Skip to content

A baseline agent for ML Research Benchmark. This agent provides a foundation for comparing and evaluating machine learning research and development tasks that agents can perform.

License

Notifications You must be signed in to change notification settings

AlgorithmicResearchGroup/ML-Research-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML Research Benchmark Baseline Agent

The ML Research Benchmark Baseline Agent is an agentic system designed to serve as a baseline for various AI and machine learning tasks. This agent provides a foundation for comparing and evaluating machine learning research and development tasks that agents can perform.

arXiv

Features

  • Supports multiple AI/ML tasks
  • Compatible with different LLM providers (OpenAI, Anthropic)
  • Dockerized for easy deployment and reproducibility

Example Video

Available Tasks

The baseline agent can perform the following tasks:

  • LLM Efficiency
  • Baby Language Model (LM)
  • Mini Pile
  • LLM Merging
  • Edge LLM Compression
  • Edge LLM Training
  • Math Reasoning (Autoformalization, Autoinformalization, Autotheorem Generation)

Mini versions of several tasks are also available for quick testing and development.

Please find the full list of tasks along with their prompts and descriptions here: ML-Research-Agent-Tasks

Available Tools

The AI Research Benchmark Baseline Agent comes equipped with a variety of tools to assist in different AI and machine learning tasks:

  1. Bash Tool: Executes bash commands and scripts.

  2. Code Tool: Manages code operations including writing, inserting, replacing, and deleting code.

  3. GitHub Tool: Interacts with GitHub repositories to get README files, list files, and retrieve file contents.

  4. Semantic Scholar Tool: Searches for academic papers, retrieves paper details, citations, and downloads papers.

  5. Python Tool: Executes Python code.

  6. Return Function Tool: Handles task completion.

  7. Scratchpad Tool: Provides a scratchpad for experiment note-taking and temporary storage.

  8. Thought Tool: Allows the agent to process and record thoughts.

  9. Long-Term Memory Tool: Manages long-term memory storage and retrieval.

These tools can be used individually or in combination to tackle a wide range of AI research and benchmark tasks. The agent can seamlessly switch between tools as needed for complex operations.

Prerequisites

  • Python 3.x
  • Docker (for containerized execution)

Installation

  1. Clone this repository:

    git clone https://github.com/AlgorithmicResearchGroup/ML-Research-Agent.git
    cd ML-Research-Agent
  2. Install dependencies:

    pip install -r requirements.txt

Usage

Running without Docker

To run the agent without Docker, use the following command:

python3 run.py --task_name llm_efficiency --benchmark full_benchmark --provider openai

Running with Docker

bash run.sh <image_name> <benchmark> <provider> <gpu_ids> <task_name> <time_limit> <huggingface_token> <env_file_path>

Example:

bash run.sh ghcr.io/algorithmicresearchgroup/ml-research-agent full_benchmark \
    openai \
    0 \
    math_reasoning \
    24h \
    <huggingface_token> \
    /home/ubuntu/.env

Available Tasks

For a full list of available tasks and their corresponding Docker run commands, please refer to tasks repo here: ML-Research-Agent-Tasks

Contributing

Contributions to improve the baseline agent or add new tasks are welcome. Please submit a pull request or open an issue to discuss proposed changes.

License

AGPL-3.0

Contact

For questions or support, please contact Algorithmic Research Group at matt@algorithmicresearchgroup.com

About

A baseline agent for ML Research Benchmark. This agent provides a foundation for comparing and evaluating machine learning research and development tasks that agents can perform.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published