What most people actually object to is a large corporation spending countless resources to vacuum up public data in order to create a privately controlled model.
I am curious how and why this seems to be viewed differently when it comes to something like proprietary code.
For example, there are large swaths of publicly available repos of code. Some are licensed under restrictive licenses, and some are public domain. Many are hosted on the Internet, and many more are written in educational books and other such materials you can find in your local library. If a business, large or small, references publicly available information to create its own proprietary code which itself does not contain any actual instances of infringing code (just as AI training data files do not contain any actual images and therefore no actual infringing data), why is that considered okay? It is extremely rare for completely new, original code to be written especially when a publicly available, well known method already exists. Why re-invent the wheel?
What I mean is, are the people that feel the way you have written upset when they see any project, from any business, large or small, that referenced anything that is publicly available? Are they upset that the names of all the references are not listed in the credits of every project ever? What is their problem with this? Does it matter whether a business that does that has 1000 employees or just 1, since the outcome is more or less the same?
Additionally, nothing prevents private citizens from doing the exact same thing themselves. A person can go along vacuuming up publicly available data to train a model only they have access to. Would those that you talk about object to that as well?