Gene has asked me to write an executive summary that conveys the essence of my new book, The Software IP Detective’s Handbook: Measurement, Comparison, and Infringement Detection. While I definitely appreciate his request, I hope I’m not completely successful because that would mean that the two years of nights and weekends I spent writing the book, not to mention the years developing the mathematical algorithms and the methodologies described in the book, could have been done in a single evening.
I’ve personally been working as an expert witness in intellectual property disputes, specializing in software cases, for about 15 years. When I began working in this area, I found that most experts used a combination of off-the-shelf computer code analysis programs, home-grown analysis programs, and lots of long hours and late nights poring over lines of code. Some experts used tools available from universities that are called “software plagiarism detection tools” that produced dubious results even when they executed correctly. Expert reports were then written and rebutted. Arguments often got very technical and detailed and could easily confuse a non-technical judge or jury. Different experts often had different definitions of plagiarism or found different signs that they considered markers for copied code. Some parties to a litigation, and some experts they hired I’m sad to say, seemed to purposely cloud the issue to justify illicit or at least questionable behavior. I decided that a standard measure of software copying that could be objectively tested was needed, and so I developed code correlation.
My book is generally about software intellectual property and specifically about the field of software forensics. Software forensics studies the software code that instructs a computer to perform operations. Software forensics discovers information about the history and usage of that software for presentation in a court of law. My book describes various kinds of code correlation measures including source code correlation, source code cross-correlation, object code correlation, and source/object code correlation that are used to measure comparisons of software to find copyright infringement. I also include chapters on detecting patent infringement and on detecting trade secret infringement as well.
Before explaining these code correlation measures, it’s important to define what they‘re measuring. To a programmer, defining the various elements that comprise software source code may seem trivial and unnecessary. However, to measure software accurately, we need to have a common definition. There are many legitimate and useful definitions of source code for various purposes. For correlation, I found it important to consider that source code consists of three basic kinds of elements: statements, from which we can derive a control structure, comments that document the code, and strings that are generally messages to users. These elements are shown in the Table 1. Statements can be further broken down into instructions and identifiers. Instructions comprise control words and operators. Identifiers comprise variables, constants, functions, and labels. A single line of source code may include one or more statements and one or more comments.
One of the challenges of defining a correlation measure was to have it yield large results even when only a small portion of code is correlated or only specific elements of code are correlated such as identifiers or comments. This is done purposely because it’s a detective’s too, designed to lead the user to suspicious sections of code regardless of how small those sections are. Many cases involve small, but important sections of code buried within very large files. In fact, this is one way that some deceptive programmers purposely hide their copying.
When copying has occurred, much of the code may have changed by the time it’s examined due to the normal development process or to disguise the copying. For example identifiers may have been renamed, code reordered, instructions replaced with similar instructions, and so forth. However, perhaps one comment remains the same and it’s an unusual comment. Or a small sequence of critical instructions is identical. Correlation is designed to produce a relatively high value based on that comment or that sequence, to direct the detective toward that similarity. If correlation were simply a percentage of copied lines, the number could be small and thus missed entirely among the noise of normal similarities that occur in all programs.
It’s important to note that source code correlation does not determine the reason for the correlation. It doesn’t determine guilt or innocence. That determination is up to the detective who uses the correlation measure and examines the code, and it is ultimately up to the court in a case of intellectual property theft or infringement. Finding a correlation between different programs does not necessarily imply that illicit behavior occurred. I’ve found that there are exactly six possible reasons for correlation:
- Third-Party Source Code
- Code Generation Tools
- Commonly Used Elements
- Common Algorithms
- Common Author
- Copying (Plagiarism, Copyright Infringement)
Two programs can be correlated because they both use the same third-party code such as open source code or commercial source code libraries. Correlation can also be caused by two programs being generated, completely or partially, by a code generation tool like Microsoft Visual Studio or Adobe Dreamweaver. Many programmers use certain elements throughout their code including identifiers with common names such as count or index or certain element names may be standard terms in a particular industry. Certain algorithms are taught in computer science schools throughout the world and so it wouldn’t be impossible for two unrelated programs to incorporate these same algorithms, resulting in correlation. And a programmer may develop a program at one company then leave and independently develop a program at another company. This is perfectly legal and, if done correctly, does not constitute copyright infringement, but there may be a coding style that matches in both programs, causing correlation. When these five reasons for correlation have been eliminated, if any correlation remains, that correlation is due to copying. Only if that copying was unauthorized and is “substantial,” does it constitute copyright infringement.
I developed correlation for use in a software product called CodeMatch® that’s now incorporated in a product called CodeSuite® that’s sold by my software company, Software Analysis and Forensic Engineering. One of the first uses of CodeMatch was in a case I call “The Case of the Overconfident Defendant.” While the large banks have their own large, in-house development teams, and outsourced development teams, to create the software to handle transactions, there are still many local banks that do not. Instead, they purchase software from small companies; these small companies can be very profitable. In this case, a bank saleswoman had seen a need for a particular type of transaction software with certain features. She started a company, hired a couple contract programmers to develop the program, and successfully sold the software to a number of banks. These banks began requesting a few extra features so she discussed them with the programmers who told her these features were very difficult if not impossible to develop. Later, however, she discovered that there was another company selling software with these exact features. After some investigation, she found that the company was run by the contract programmers. She hired a law firm and sued for copyright infringement (and probably trade secret theft, but I don’t recall).
The contract programmers insisted that they had written the code from scratch and they said they were so confident that they were willing to allow a code expert (me) to examine all of their code. They did place two conditions—the examination had to take place at their attorney’s office and I had exactly one business day to complete the analysis. They didn’t know about CodeMatch.
The programmers had done a very good job covering up their tracks. But not good enough. For example, they had forgotten about the installer, the utility program that installs the main program on a user’s machine. I discovered a funny thing about the installer. These were Windows programs. Windows has a file called a “registry” that contains information about each program installed on the conputer. Each software company creates special locations in the registry for the information for their programs. Microsoft store program information in registry locations labeled “Microsoft” while Symantec programs store program information in registry locations labeled “Symantec” for example. I found correlation between the installers for the two programs and when I looked at the correlated code I found that the information about the program from the contract programmers’ company BBBB was being installed in the registry in the location labeled for the saleswoman’s company AAAA. Not only could I show that the installer code was copied, and could also show that the program code was copied, but I could show that the original code was developed for the saleswoman’s company and subsequently copied by the programmers for their own company. CodeMatch, and source code correlation, had proven their worth.