AI Code Is Like Public Domain Code

GitHub’s CoPilot tool may well be revolutionary, according to Bradley Kuhn. An AI trained by reading a massive and unidentified corpus of code, assumed to mostly be open source and licensed for any use to Github under their terms of use, it is able to watch what you are coding in your IDE and make suggestions on how to autocomplete the code – potentially at length. It is a kind of Clippy for code. It has just had the ultimate validation; Amazon copied it.

Spitfire in Guildhall Square, Southampton (ironically with no space for a co-pilot)
No room for a co-pilot

Sure, quit Github

While that may seem an unalloyed good to many programmers, there is an outbreak of moral panic surrounding it, as evidenced by the recent call to boycott GitHub because of it. Now, I am all in favour of people using distributed tools instead of centralised ones. Git itself is intended as a distributed tool and in a way it’s offensive for GitHub to have annexed its name to create a centralised and proprietary control point.

I am also keen for everyone as far as they can to exercise self-sufficiency over their computing and control of their personal data, and given Git was written as a response to the final abridgement of that self-sovereignty by the author of an earlier tool that the Linux developers were dependent on, Github is again somewhat offensive. Those would both be fine reasons to encourage people to move on from Github and to escape the social honeypot of carefully crafted network effect funnels that it embodies.

… but not because of Copilot

But Copilot is not a great reason to quit, or at least not for the reasons people insist on articulating. Those reasons seem strong on copyleft maximalism and the homeopathic thinking that assumes because there was GPL vapour in the air everything written at the time is infused. They also seem laced with a residual mistrust of Microsoft.

  • Copilot is unlikely to be infringing copyright. Certainly not in the USA. Probably not in most other places (although see Brown for more nuance). Even for humans, learning patterns doesn’t infringe copyrights, and quoting minimal or essential fragments rarely rises to the level needed for protection by copyright. Copyrights are not the same as patents, and re-expressing the same idea does not amount to infringement – even if such infringement were possible for a machine. Which it is not, so all these considerations are moot in many jurisdictions.
  • Copilot is unlikely to be breaching the GPL. That could only happen if copyright was being infringed. Just because the author of a work doesn’t like use of their code by Microsoft’s tool, that doesn’t somehow create an infringement that triggers the license.
  • Copilot in not morally bankrupt for using open source code for training. The whole point of Open Source Free Software is to give anyone the unconditional right to study the code and learn from it. If that’s a via an automated tool that makes the matter more efficient, it makes no difference.

Making a new thing that does the same as my patented widget is always an infringement of my patent, but making a new thing that does the same as my copyrighted code is not. An unfortunate consequence of the propaganda term “Intellectual Property” is that non-specialists munge all the concepts for all of {Copyright, Patents, Trade Secrets, Trademarks, Database rights} into one big hairball and assume anything matching the hairball triggers some form of infringement of any/all of the concepts. So arguments that mix-and-match IP concepts to imply an infringement are … problematic.

You shouldn’t use it for Open Source though

AI code helpers like Copilot are thus very unlikely to infringe rights per se. But that doesn’t mean code made by them should be welcome in Open Source projects.

To summarise a long article, Reda concludes that the output of an AI like Copilot is best understood as Public Domain. But ironically, that’s the real problem with Copilot for an Open Source developer. Public Domain is not Open Source, and AI-generated code introduces friction that works against the Open Source network effect for just the same reasons. As Brown explains, not every jurisdiction has the same degree of certainty or the same attributes to its conclusion about AI-generated works as seems commonly understood in the USA.

So while you may feel comfortable using AI-generated blocks in your code, what will you write in the pull-request to give others the same confidence? Even Github (and indeed Amazon) are at pains to point out that’s your responsibility, not theirs. Their tool may be a very helpful learning aid, but it’s something of a trap for the responsible Open Source contributor.

There’s a different case to be understood in every jurisdiction both about the code origin and the threshold for copyrightability. While the (many) lawyers I have heard from have largely waved a hand and said the arguments would never stand up in court, the arguable cases create a context where a community can’t rely on AI-generated code without further advice. Just like Public Domain, that added friction makes it non-viable for any community serious about provenance.

The biggest challenges are the ones exerting subtle, systemic steering effects that people don’t take seriously. Github may not be a digital scofflaw, but their tool is a Siren tempting you onto rocks that can ruin communities.

(Thanks to the Patreon backers who made this post possible)