Edit

adap · Oct 10, 2024 · f43769e · f43769e
2 parents 0d598e3 + ad7fd70
commit f43769e
Show file tree

Hide file tree

Showing 119 changed files with 4,144 additions and 3,266 deletions.
diff --git a/.github/workflows/_docker-build.yml b/.github/workflows/_docker-build.yml
@@ -84,7 +84,7 @@ jobs:
           images: ${{ inputs.namespace-repository }}
 
       - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@988b5a0280414f521da01fcc63a27aeeb4b104db # v3.6.1
+        uses: docker/setup-buildx-action@c47758b77c9736f4b2ef4073d4d51994fabfe349 # v3.7.1
 
       - name: Login to Docker Hub
         uses: docker/login-action@9780b0c442fbb1117ed29e0efdff1e18412f7567 # v3.3.0
@@ -145,7 +145,7 @@ jobs:
           tags: ${{ inputs.tags }}
 
       - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@988b5a0280414f521da01fcc63a27aeeb4b104db # v3.6.1
+        uses: docker/setup-buildx-action@c47758b77c9736f4b2ef4073d4d51994fabfe349 # v3.7.1
 
       - name: Login to Docker Hub
         uses: docker/login-action@9780b0c442fbb1117ed29e0efdff1e18412f7567 # v3.3.0

diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml
@@ -311,3 +311,39 @@ jobs:
           if grep -q "ERROR" flwr_output.log; then
             exit 1
           fi
+
+  build_and_install:
+    runs-on: ubuntu-22.04
+    timeout-minutes: 10
+    needs: wheel
+    strategy:
+      matrix:
+        framework: ["numpy"]
+        python-version: ["3.9", "3.10", "3.11"]
+
+    name: |
+      Build & Install /
+      Python ${{ matrix.python-version }} /
+      ${{ matrix.framework }}
+
+    steps:
+      - uses: actions/checkout@v4
+      - name: Bootstrap
+        uses: ./.github/actions/bootstrap
+        with:
+          python-version: ${{ matrix.python-version }}
+          poetry-skip: 'true'
+      - name: Install Flower from repo
+        if: ${{ github.repository != 'adap/flower' || github.event.pull_request.head.repo.fork || github.actor == 'dependabot[bot]' }}
+        run: |
+          python -m pip install .
+      - name: Install Flower wheel from artifact store
+        if: ${{ github.repository == 'adap/flower' && !github.event.pull_request.head.repo.fork && github.actor != 'dependabot[bot]' }}
+        run: |
+          python -m pip install https://${{ env.ARTIFACT_BUCKET }}/py/${{ needs.wheel.outputs.dir }}/${{ needs.wheel.outputs.short_sha }}/${{ needs.wheel.outputs.whl_path }}
+      - name: Create project, build, and install it
+        run: |
+          flwr new tmp-${{ matrix.framework }} --framework ${{ matrix.framework }} --username gh_ci
+          cd tmp-${{ matrix.framework }}
+          flwr build
+          flwr install *.fab
diff --git a/benchmarks/flowertune-llm/README.md b/benchmarks/flowertune-llm/README.md
@@ -42,7 +42,7 @@ The `flwr new` command will generate a directory with the following structure:
 This can serve as the starting point for you to build up your own federated LLM fine-tuning methods.
 
 > [!IMPORTANT]
-> Please note that if you intend to submit your project as an entry to the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard) modifications to `[tool.flwr.app.config.static]` and `[tool.flwr.federations.local-simulation]` sections in the `pyproject.toml` are not allowed and will invalidate the submission.
+> Please note that if you intend to submit your project as an entry to the [LLM Leaderboard](https://flower.ai/benchmarks/llm-leaderboard) modifications to the `[tool.flwr.app.config.static]` section and `options.num-supernodes` under the `[tool.flwr.federations.local-simulation]` section in the `pyproject.toml` are not allowed and will invalidate the submission.
 
 
 ## Run FlowerTune LLM challenges

diff --git a/benchmarks/flowertune-llm/evaluation/README.md b/benchmarks/flowertune-llm/evaluation/README.md
@@ -17,9 +17,9 @@ The default template generated by `flwr new` (see the [Project Creation Instruct
 
 ### General NLP
 
-|          | MT-1 | MT-2 | MT-Avg |  
-|:--------:|:----:|:----:|:------:|
-| MT Score | 5.54 | 5.52 |  5.53  |
+|         | STEM  |  SS   | Humanities |  Avg  |
+|:-------:|:-----:|:-----:|:----------:|:-----:|
+| Acc (%) | 12.37 | 13.49 |   12.60    | 12.82 |
 
 ### Finance
 

diff --git a/benchmarks/flowertune-llm/evaluation/general-nlp/README.md b/benchmarks/flowertune-llm/evaluation/general-nlp/README.md
@@ -1,8 +1,8 @@
 # Evaluation for General NLP challenge
 
-We leverage MT-Bench metric provided by [FastChat](https://github.com/lm-sys/FastChat) to evaluate fine-tuned LLMs.
-[MT-Bench](https://arxiv.org/abs/2306.05685) represents a comprehensive suite of multi-turn, open-ended questions designed to evaluate chat assistants.
-Strong LLMs, such as GPT-4, serve as judges to assess the quality of responses provided by the chat assistants under examination.
+We build up a multi-task language understanding pipeline to evaluate our fined-tuned LLMs.
+The [MMLU](https://huggingface.co/datasets/lukaemon/mmlu) dataset is used for this evaluation, encompassing three categories: STEM, social sciences (SS), and humanities.
+
 
 ## Environment Setup
 
@@ -20,44 +20,22 @@ pip install -r requirements.txt
 huggingface-cli login
 ```
 
-Download data from [FastChat](https://github.com/lm-sys/FastChat):
-
-```shell
-git clone https://github.com/lm-sys/FastChat.git && cd FastChat && git checkout d561f87b24de197e25e3ddf7e09af93ced8dfe36 && mv fastchat/llm_judge/data ../data && cd .. && rm -rf FastChat
-```
-
-
-## Generate model answers from MT-bench questions
-
-```bash
-python gen_model_answer.py --peft-path=/path/to/fine-tuned-peft-model-dir/ # e.g., ./peft_1
-```
-The answers will be saved to `data/mt_bench/model_answer/[base_model_name].jsonl` in default.
-
-
-## Generate judgments using GPT-4
-
-Please follow these [instructions](https://platform.openai.com/docs/quickstart/developer-quickstart) to create a OpenAI API key.
-The estimated costs of running this evaluation is approximately USD10.
+## Generate model decision & calculate accuracy
 
 > [!NOTE]
-> If you changed the base model of your LLM project specify it to the command below via `--model-list`.
+> Please ensure that you use `quantization=4` to run the evaluation if you wish to participate in the LLM Leaderboard.
 
 ```bash
-export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
-python gen_judgement.py --model-list Mistral-7B-v0.3
+python eval.py \
+--peft-path=/path/to/fine-tuned-peft-model-dir/ \ # e.g., ./peft_1
+--run-name=fl  \ # specified name for this run  
+--batch-size=16 \
+--quantization=4 \
+--category=stem,social_sciences,humanities
 ```
 
-The judgments will be saved to `data/mt_bench/model_judgment/gpt-4_single.jsonl` in default.
-
+The model answers and accuracy values will be saved to `benchmarks/generation_{dataset_name}_{category_name}_{run_name}.jsonl` and `benchmarks/acc_{dataset_name}_{category_name}_{run_name}.txt`, respectively.
 
-## Show MT-bench scores
-
-```bash
-python show_result.py --model-list Mistral-7B-v0.3
-```
-GPT-4 will give a score on a scale of 10 to the first-turn (MT-1) and second-turn (MT-2) of the conversations, along with an average value as the third score.
 
 > [!NOTE]
-> Please ensure that you provide all **three scores** when submitting to the LLM Leaderboard (see the [`Make Submission`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation#make-submission-on-flowertune-llm-leaderboard) section).
-
+> Please ensure that you provide all **three accuracy values (STEM, SS, Humanities)** for three evaluation categories when submitting to the LLM Leaderboard (see the [`Make Submission`](https://github.com/adap/flower/tree/main/benchmarks/flowertune-llm/evaluation#make-submission-on-flowertune-llm-leaderboard) section).
diff --git a/benchmarks/flowertune-llm/evaluation/general-nlp/benchmarks.py b/benchmarks/flowertune-llm/evaluation/general-nlp/benchmarks.py
@@ -0,0 +1,201 @@
+import json
+
+import pandas as pd
+from sklearn.metrics import accuracy_score
+from torch.utils.data import DataLoader
+from tqdm import tqdm
+from utils import format_answer, format_example, save_results
+
+from datasets import Dataset, load_dataset
+
+INSTRUCTIONS = {
+    "mmlu": "Answer the following multiple choice question.",
+}
+
+MMLU_CATEGORY = {
+    "stem": [
+        "abstract_algebra",
+        "anatomy",
+        "astronomy",
+        "college_biology",
+        "college_chemistry",
+        "college_computer_science",
+        "college_mathematics",
+        "college_physics",
+        "computer_security",
+        "conceptual_physics",
+        "electrical_engineering",
+        "elementary_mathematics",
+        "high_school_biology",
+        "high_school_chemistry",
+        "high_school_computer_science",
+        "high_school_mathematics",
+        "high_school_physics",
+        "high_school_statistics",
+        "machine_learning",
+    ],
+    "social_sciences": [
+        "econometrics",
+        "high_school_geography",
+        "high_school_government_and_politics",
+        "high_school_macroeconomics",
+        "high_school_microeconomics",
+        "high_school_psychology",
+        "human_sexuality",
+        "professional_psychology",
+        "public_relations",
+        "security_studies",
+        "sociology",
+        "us_foreign_policy",
+    ],
+    "humanities": [
+        "formal_logic",
+        "high_school_european_history",
+        "high_school_us_history",
+        "high_school_world_history",
+        "international_law",
+        "jurisprudence",
+        "logical_fallacies",
+        "moral_disputes",
+        "moral_scenarios",
+        "philosophy",
+        "prehistory",
+        "professional_law",
+        "world_religions",
+    ],
+    "other": [
+        "business_ethics",
+        "clinical_knowledge",
+        "college_medicine",
+        "global_facts",
+        "human_aging",
+        "management",
+        "marketing",
+        "medical_genetics",
+        "miscellaneous",
+        "nutrition",
+        "professional_accounting",
+        "professional_medicine",
+        "virology",
+    ],
+}
+
+
+def infer_mmlu(model, tokenizer, batch_size, category, run_name):
+    name = "mmlu"
+    answer_type = "mcq"
+
+    # Download dataset
+    dataframes = []
+    for subset in MMLU_CATEGORY[category]:
+        subset_data = load_dataset(
+            "lukaemon/mmlu",
+            subset,
+            split="test",
+            trust_remote_code=True,
+        )
+        subset_df = pd.DataFrame(subset_data.map(lambda x: {"subset": subset, **x}))
+        dataframes.append(subset_df)
+
+    dataset_df = pd.concat(dataframes, axis=0)
+    dataset = Dataset.from_pandas(dataset_df)
+    if "__index_level_0__" in dataset.column_names:
+        dataset = dataset.remove_columns("__index_level_0__")
+
+    # Post process
+    instruction = INSTRUCTIONS[name]
+
+    def post_process(row):
+        options = [row["A"], row["B"], row["C"], row["D"]]
+        row["prompt"] = format_example(row["input"], options)
+        row["gold"] = row["target"]
+        row["subset"] = row["subset"]
+        row["prompt"] = f"{instruction}\n{row['prompt']}\nThe answer is:\n"
+        return row
+
+    dataset = dataset.map(post_process)
+
+    # Generate results
+    generate_results(
+        name, run_name, dataset, model, tokenizer, batch_size, answer_type, category
+    )
+
+
+def generate_results(
+    name, run_name, dataset, model, tokenizer, batch_size, answer_type, category
+):
+    # Run inference
+    prediction = inference(dataset, model, tokenizer, batch_size)
+
+    # Calculate accuracy
+    acc = accuracy_compute(prediction, answer_type)
+
+    # Save results and generations
+    save_results(name, category, run_name, prediction, acc)
+
+
+def inference(dataset, model, tokenizer, batch_size):
+    columns_process = ["prompt", "gold"]
+    if "subset" in dataset.features:
+        columns_process.append("subset")
+    dataset_process = pd.DataFrame(dataset, columns=dataset.features)[columns_process]
+    dataset_process = dataset_process.assign(output="Null")
+    temperature = 1.0
+
+    inference_data = json.loads(dataset_process.to_json(orient="records"))
+    data_loader = DataLoader(inference_data, batch_size=batch_size, shuffle=False)
+
+    batch_counter = 0
+    for batch in tqdm(data_loader, total=len(data_loader), position=0, leave=True):
+        prompts = [
+            f"<|im_start|>question\n{prompt}<|im_end|>\n<|im_start|>answer\n"
+            for prompt in batch["prompt"]
+        ]
+        if batch_counter == 0:
+            print(prompts[0])
+
+        # Process tokenizer
+        stop_seq = ["###"]
+        if tokenizer.eos_token is not None:
+            stop_seq.append(tokenizer.eos_token)
+        if tokenizer.pad_token is not None:
+            stop_seq.append(tokenizer.pad_token)
+        max_new_tokens = len(
+            tokenizer(batch["gold"][0], add_special_tokens=False)["input_ids"]
+        )
+
+        outputs = []
+        for prompt in prompts:
+            input_ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
+            output_ids = model.generate(
+                inputs=input_ids,
+                max_new_tokens=max_new_tokens,
+                do_sample=False,
+                top_p=1.0,
+                temperature=temperature,
+                pad_token_id=tokenizer.eos_token_id,
+            )
+            output_ids = output_ids[0][len(input_ids[0]) :]
+            output = tokenizer.decode(output_ids, skip_special_tokens=True)
+            outputs.append(output)
+
+        for prompt, out in zip(batch["prompt"], outputs):
+            dataset_process.loc[dataset_process["prompt"] == prompt, "output"] = out
+        batch_counter += 1
+
+    return dataset_process
+
+
+def accuracy_compute(dataset, answer_type):
+    dataset = json.loads(dataset.to_json(orient="records"))
+    preds, golds = [], []
+    for row in dataset:
+        answer = row["gold"].lower()
+        output = row["output"].lower()
+        pred, gold = format_answer(output, answer, answer_type=answer_type)
+        preds.append(pred)
+        golds.append(gold)
+
+    accuracy = accuracy_score(preds, golds)
+
+    return accuracy