BigCode, an open scientific collaboration spearheaded by Hugging Face and ServiceNow, focuses on the responsible development of large language models for code. This blog post will introduce you to their innovative StarCoder and StarCoderBase models and discuss their evaluation, capabilities, and the resources available to support their use.
Introducing StarCoder and StarCoderBase
StarCoder and StarCoderBase are Large Language Models for Code (Code LLMs) trained on permissively licensed data from GitHub. This includes data from 80+ programming languages, Git commits, GitHub issues, and Jupyter notebooks. StarCoderBase was fine-tuned for 35 billion Python tokens, resulting in the new model, StarCoder.
These models boast impressive performance, outperforming existing open Code LLMs on popular programming benchmarks while matching or surpassing closed models such as OpenAI's code-cushman-001. With a context length of over 8,000 tokens, StarCoder models can process more input than any other open LLM, allowing for a wide range of applications.
Applications of StarCoder Models
Some potential applications of the StarCoder models include:
Technical Assistance: By prompting the models with a series of dialogues, they can function as a technical assistant.
Code Autocompletion: The models can autocomplete code based on the input provided.
Code Modification: They can make modifications to code via instructions.
Code Explanation: The models can explain a code snippet in natural language.
Evaluation of StarCoder Models
StarCoder and similar models were thoroughly evaluated using a variety of benchmarks. StarCoder, in particular, achieved a new state-of-the-art result for open models on the HumanEval benchmark, scoring over 40% when given a specific prompt. Additionally, the model was found to match or outperform code-cushman-001 in many languages on the MultiPL-E benchmark, as well as on a data science benchmark called DS-1000.
StarCoder as a Tech Assistant
Besides code completion, StarCoder has proven to be an excellent tech assistant. When provided with a Tech Assistant Prompt, the model can answer programming-related requests, making it a versatile tool for developers.
Training Data and Additional Releases
StarCoder was trained on a subset of The Stack 1.2, a dataset consisting of permissively licensed code. Personal Identifiable Information (PII) was removed from the training data, ensuring the model's safety and privacy.
BigCode has released several resources and demos, including:
Model weights and intermediate checkpoints with OpenRAIL license
Code for data preprocessing and training with Apache 2.0 license
A comprehensive evaluation harness for code models
A new PII dataset for training and evaluating PII removal
The fully preprocessed dataset used for training
A code attribution tool for finding generated code in the dataset
Resources and Links
To learn more about StarCoder and StarCoderBase, explore the technical report, GitHub repositories, and various tools and demos available at huggingface.co/bigcode. With StarCoder's impressive performance, developers can leverage these models to enhance their work and create innovative solutions.
Comments