<!-- Content Here -->

Where content meets technology

Jun 22, 2023

AI and Content Licensing

In my last re-platforming of this blog, I accidentally dropped the Creative Commons Attribution licensing that I had been using. Blogging platforms treat licensing as part of the format rather than the content itself. The format is part of the CMS theme so when the CMS changes, the content moves but the licensing does not. I am still trying to make up my mind as to whether I think that is a good thing. But at the moment, I am thinking about the broader issue of content re-use and attribution in light of being used as AI training data.

People publish content for a variety of reasons. Personally, I write to explore and refine ideas and also for the potential to discuss these topics with people who stumble across my posts (although that rarely happens). There is also a recognition element. My blog is where people can associate me with what I know and think. Many websites and communities are built around the value of recognition. For example, sites like Stack Overflow have a culture around recognizing and rewarding expertise.

I have been using the Creative Commons Attribution license because I want people to use and further my ideas and I also want to be part of the ongoing discussion and evolution of those ideas. Based on the language of the license, I thought it would protect these interests:

"You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use"

But, according to the article "Should CC-Licensed Content be Used to Train AI? It Depends" (by Brigitte Vézina and Sarah Hinchliff Pearson), there is no agreement that any form of copyright applies to AI training.

Large Language Models, trained on terabytes of content (the GPT-4 dataset is 1 petabyte), create new value for content consumers wanting condensed answers. But that intermediation saps value from the content producers and publishers. The ChatGPT user has no idea if some pearl of wisdom came from me (doubtful) and I have no idea if my knowledge was accessed or what became of it.

I think that I will continue to write even though I know my words will be anonymized by AI. I still get the value of using writing to organize my thoughts and to develop my communication skills. Jack Ivers has a great post describing the reasons for writing every day. But I don't think I would be as excited to post answers on Stack Overflow unless I wanted to build adoption for a particular technology that I supported. I am even less likely to post an answer on Quora.

I wonder if AI chatbots will stifle other contributors' motivation. Perhaps it already has but I haven't heard much of an uproar. If generative AI drives the extinction of user generated content (which helps improve AI), the progress of knowledge will slow because it will not be able to incorporate new experiences.

Wikipedia is a bit different. Wikipedia contributors are mainly concerned about the accuracy of the content rather than attribution. In many ways, personal attribution taints the authority of the article with the possibility of bias. Consequently, you have to dig to find who wrote what. Wikipedia is already harvested by search engines and voice assistants (both Alexa and Google Assistant rely heavily on it). The contributors don't seem to mind.

For now, I have re-added a Creative Commons license to the footer of this blog and the syndication feed (Pelican, I might submit a pull request for that). Not that it does any good.