Creating Synthetic Dataset Using Llama 3.1 405B and Nemotron 4

In this notebook we will use the following structure to create a synthetic dataset of Intructions and Git Commands.

We will create a set of instructions related to git queries in natural language, then we will generate the response for each instruction.

The instruction/response pairs will be passed to a reward model, Nemotron 4, to filter out any bad pairs.

Finally, the dataset will be pushed to HuggingFace.

Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
create_dataset.ipynb		create_dataset.ipynb
filter_and_push.ipynb		filter_and_push.ipynb
image.png		image.png
synthetic_data.jsonl		synthetic_data.jsonl
synthetic_data_filtered.jsonl		synthetic_data_filtered.jsonl