AI-driven code generators like GitHub Copilot and Amazon CodeWhisperer are increasingly being embraced by developers. However, these tools often come with limitations, including paid subscriptions or restrictive licenses that prohibit commercial use. Recognizing the demand for alternatives, AI startup Hugging Face partnered with ServiceNow, a workflow automation platform, to develop StarCoder, an open-source code generation tool with a more flexible license. The initial version, StarCoder, debuted last year, and work has since continued on StarCoder 2.
Unlike its predecessor, StarCoder 2 comprises a family of code-generating models. Released recently, it includes three variants designed to run on most modern consumer GPUs:
– A model with 3 billion parameters, developed by ServiceNow.
– A 7-billion-parameter (7B) model developed by Hugging Face
– A 15-billion-parameter (15B) model developed by Nvidia, the latest contributor to the StarCoder project
These “parameters” represent the parts of a model learned from training data and determine its proficiency in generating code. Like other code generators, StarCoder 2 can suggest code completions and retrieve snippets based on natural language queries. Trained with significantly more data than its predecessor (67.5 terabytes compared to 6.4 terabytes), StarCoder 2 promises improved performance at lower operational costs.
Developers can fine-tune StarCoder 2 in a matter of hours using GPUs, such as Nvidia’s A100, to create applications like chatbots and coding assistants. Moreover, StarCoder 2’s extensive training dataset (covering approximately 619 programming languages) enables more accurate predictions tailored to specific contexts.
While StarCoder 2 offers several advantages, concerns have been raised about the security and legality of code generated by such tools. Some studies suggest that code-generating systems may introduce security vulnerabilities, and there are worries about the lack of transparency in their operations. StarCoder 2, licensed under the BigCode Open RAIL-M 1.0, seeks to address these concerns by imposing light-touch restrictions and ensuring responsible usage. However, the license’s requirements and potential conflicts with AI-related regulations remain subjects of debate.
Despite these challenges, StarCoder 2 distinguishes itself by training solely on data licensed from the Software Heritage, minimizing the risk of copyright infringement. The project emphasizes transparency and provides access to its training data for auditing purposes.
While not without flaws, StarCoder 2 represents a significant advancement in AI-driven code generation. It offers developers an ethical and transparent alternative, although challenges like bias and language limitations persist. Nevertheless, with contributions from ServiceNow, Hugging Face, and Nvidia, StarCoder 2 aims to foster trust and accountability in AI models while offering paid services built on top of its open-source platform.