GitHub Copilot has revolutionized the way developers write code, yet it can introduce challenges by generating code similar to existing content in public repositories. To address this, GitHub introduced a feature in 2022 that enables users to automatically block suggestions matching public code. According to a GitHub spokesperson, this system’s activation is infrequent, occurring less than 1% of the time. However, developers may still want to access these code fragments for legitimate purposes, such as adhering to licensing restrictions or considering the integration of entire libraries from which the snippets originated.
In pursuit of a balanced approach, GitHub has introduced a private beta for a code referencing feature within GitHub Copilot. This new functionality grants developers the choice when dealing with matching code suggestions. With code referencing enabled, Copilot no longer automatically blocks generated code but presents it in a sidebar for developers to decide its usage. As the feature evolves, it will also extend to Copilot Chat in due course.
According to GitHub CEO Thomas Dohmke, both Microsoft, GitHub, and numerous Copilot enterprise customers utilized the initial blocking feature. However, he acknowledged its limitations, describing it as a somewhat rigid approach. Dohmke emphasized that this approach leaves developers with limited control in determining whether to incorporate the generated code while properly attributing it to an open-source license. Furthermore, it restricts their ability to explore potential libraries that might offer suitable solutions instead of synthesizing code. As a result, developers may inadvertently reproduce existing code from open-source repositories without the opportunity to contribute via pull requests.
Dohmke highlighted that this situation frequently arises with ubiquitous computer algorithms, such as sorting, which can be found in various locations. Developers now have three options when encountering such code: they can reject it outright, utilize it directly (if the library permits), or employ Copilot to modify the code so that it no longer matches the original snippet.
Currently, the system does not offer the ability to exclusively view results that align with specific licenses. However, the development team is actively seeking feedback from users to gauge interest in this potential feature.
According to Dohmke, the new approach allows users to comprehend the match and subsequently explore their options or make informed decisions. He believes this addresses the gap left by the original solution, providing developers with more flexibility and control in handling code matches.
The code referencing feature is more likely to activate when Copilot lacks substantial context to work with. When Copilot can access abundant context from the existing code being worked on, the likelihood of generating suggestions matching public code diminishes. However, when starting from scratch, the chances of generating matching code significantly increase.
Central to this functionality is an exceptionally fast search engine, aiming to maintain low latency of 10-20ms, which swiftly identifies matching code and its corresponding license. Currently, the matching code snippets are listed based on the order in which the search engine discovers them. In the initial announcement, GitHub expressed intentions to grant developers the ability to sort this inventory by repository license, commit date, and more. This anticipated functionality is likely to be added in the future.