“Copyright remains untested in the ML context…. Whether fair use may provide a defense against copyright liability in this context is untested, nor is it determined whether the output of ML engines are derivative works of the original copyrighted matter.”
In Part I of this series, we discussed the Federal Trade Commission’s (FTC’s) case against Everalbum as just one example where companies may be required to remove data from their machine learning models (or shut down if unable to do so). Following are some additional pitfalls to note.
A. Evolving privacy and data usage restrictions
Legislators at the international, federal, state, and even municipal levels are continually enacting laws to restrict how companies may process data. The consequence is that data which companies had once lawfully obtained may become illegal to process for desired purposes. For example:
- Abrogation of European Privacy Shield and Standard Contractual Clause Concerns. In 2020, the European Court of Justice invalidated the “Privacy Shield” and called into question the ability of companies to blindly rely on Standard Contractual clauses that had facilitated transfers of data from Europe to the United States. The current European Data Protection Board Guidance and newly minted Standard Contractual Clauses require businesses to conduct complex legal analysis and enter into lengthy contracts to try to legitimize these data transfers though there remain many questions as to which transfers may ultimately be permitted.
- Health Insurance Portability and Accounting Act (HIPAA). Although generally, the most recent changes to and clarifications of HIPAA allow for more expansive ability for covered entities to share personal health information (PHI), the issue remains that authorization must be obtained for all uses not covered by HIPAA Privacy Rule. The issue many companies run into is that PHI is obtained with a particular consent or authorization, and later, the company wants to use, or does use, the PHI in a way for which authorization was not obtained.
- General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA). The GDPR and CCPA give consumers the right to have their data deleted. Thus, whether companies may continue to use machine learning models that “learned” from traces of those consumers’ data is an open question.
- Municipal-level regulations: Several municipalities, such as San Francisco, have been on the verge of passing local-level privacy regulations. Such laws would demand even greater flexibility in targeted removal of data.
As in any of these examples, companies that had properly obtained data may find themselves unable to legally process that data after laws are changed. Neural network systems trained on such data may be vulnerable to if the laws forbid ongoing processing of that data.
B. Trade secret misappropriation through employee mobility
Employees changing jobs and bringing data to their next employer are a constant source of litigation. These cases may stem from malicious misappropriation, or may result when a departing employee fails to delete cloud-based data repositories that contain confidential data. In such cases, if the prior employer’s data can be traced to the next employer, there may be a credible allegation that the offending data was ingested into the new employer’s ML model.
For example, in WeRide Corp. v. Huang, an executive and a technical employee of one self-driving car company joined a rival. The technical employee downloaded large amounts of data before hopping companies, and allegedly brought the data to his new employer, whose capabilities suddenly improved for detecting pedestrians. On a motion for preliminary injunction, the Northern District of California enjoined defendants from using the former employer’s confidential information, as is commonly ordered in a trade secret cases. In the NN ML context, however, it is unclear how to comply with such an order, because it may be impossible to stop using traces of the stolen data. (Because WeRide was resolved through terminating sanctions, resolving that issue in court will await another case).
C. Violations of copyright and terms of service from public-facing sources
Companies that ingest data from the Internet run the risk of violating questionable permissions. While some data is freely available, other sources are tightly controlled, with a broad continuum of legal restrictions between these poles. Public-facing data might be ingested into ML models by many techniques, including through scraping and through application programming interface (“API”) access. Scraping is not necessarily illegal, and in one example, the Ninth Circuit found that HiQ had not violated the Computer Fraud and Abuse Act when it scraped professional biographies from LinkedIn. Nonetheless, the Ninth Circuit reserved ruling on whether other laws had been violated.
Copyright remains untested in the ML context, although GitHub’s CoPilot may soon be a testing ground. Depending on the approach taken, it is likely that a copy of data is generated and reproduced, for example through the scraping process and by ingesting the data into a ML model. Whether fair use may provide a defense against copyright liability in this context is untested, nor is it determined whether the output of ML engines are derivative works of the original copyrighted matter. Thus, companies run a gamble in trying to guess what the copyright fair use defense will protect when ingesting data from the Internet.
The above scenarios are only a handful of the situations in which companies might ingest legally tainted data into their ML systems. If caught, then attempting to stop the use of misappropriated data raises difficult questions. Because neural networks systems typically have no “rewind button” to remove traces of the misappropriated data, courts could potentially order that ML model to be destroyed.
Companies should plan ahead to avoid being saddled by court orders that may be impossible to obey. Understanding that it may be infeasible to erase the imprint of data from neural network systems, following are some guidelines that could prevent the problem, or at least cushion the consequences of having been caught ingesting tainted data.
- Consider alternate architectures, such as k-NN. Approaches such as k-NN allow editability, to remove data from the ML model. While not feasible for all applications, consider whether an editable system might work as an architecture.
- Licensing your datasets. Datasets that are used to train your model should be properly licensed. Carefully record your licensing practices, and enforce this scheme throughout your employees, to ensure that the entire team is in compliance.
- Preserving data lineage. Companies developing ML models should preserve records documenting the sources of data ingested. If later accused of having stolen data, there is nothing as effective as being able to point to records confirming the legitimate source of the data.
- Version control of training datasets. Training data sets evolve, with additional data often added to improve the outcomes (reduced error rate), expand scope, and to cover edge cases. In software engineering, version control is widely used to manage the evolution of software source code, such that any change to the software is committed to a repository, which contains every version of the software over time. Consideration should be given to doing the same for data sets. Although it would require considerably more memory to store training data than a trained neural network model, having access to various versions of the data used to train the model as well as the model itself would enable the company to revert to an earlier state in the event of an adverse court order.
For example, a company may acquire a start-up, without fully knowing the source of that start-up’s data, and may want to use that acquired data to improve its ML models. Or a company might hire an employee, where there is a risk of a trade secret suit from the new hire’s former employer. In these circumstances, a good preventative measure would be managing the versions of the ML model and its training data prior to onboarding the new data. If an issue occurred with the new data, the ML model could in theory be regenerated from the versioned data.
- Tracking ephemeral data. Frequently, data that train ML models cannot be archived. For example, self-driving cars stream a constant torrent of data to central servers. This data may be simply ephemeral, and lost after feeding through the ML model. In that situation, maintaining good records of the architecture of the data intake procedures, along with the licensing scheme for data to ensure proper use, may be the best approach.
- Due diligence demands. Companies acquiring targets should seek records of what data was ingested to develop their ML models. As part of due diligence, acquirers should take confirm that the ingested data was lawfully obtained.
In summary, companies should be prepared to prove that their ML models have been trained with authorized data, and to structure an “out” if their data sets are tainted. Architectural solutions, such as k-NN systems, may provide proper editability. And proactively, thorough version control, documentation, and due diligence, companies can minimize their exposure to a court order that may be otherwise impossible to obey without the catastrophic loss of entire ML models.
Image Source: Deposit Photos