SourceFinder software stalks malware in the wild

The best defense against malware is a good offense: inoculating computers against malicious programs.

Unfortunately, security researchers often don’t know about a piece of malware until it has already done a lot of damage. Trying to reverse engineer malware can be extremely difficult without any insight into the original program.

Computer scientists at UC Riverside have developed a tool called SourceFinder that locates online malware source code repositories with 89% accuracy. Access to the source code will help security specialists and anti-virus software developers understand threats and design defenses.

How researchers identified 7.5K malware repositories — Rokon et. al. (2020) identified 7.5K malware source code repositories in GitHub starting from 32M repositories based on 137 malware keywords.

"The study emerged serendipitously as we were studying the rapid growth of Internet of Things malware and identifying the publicly available malware source codes they are based on,” said Ahmad Darki, a recent UC Riverside Ph.D. recipient who came up with the idea for the project. This was especially surprising as no one could have imagined that malware developers would share their code in public to gain credit in their community and help others around the world to engage in malicious activities.”

Source code is the part of a computer program humans can read. Developers run the source code they write through programs that translate it into secondary code unintelligible to humans. Some software companies and developers only release the finished, or compiled, code so no one can change it. Others release the source code, so users can find problems, make improvements, or customize the program. Trying to figure out what makes a program tick without access to the source code can be very difficult.

"It has been very rewarding to have research groups contact us within days of the conference presentation asking for our dataset. It just shows the need for such a resource. We are thrilled our work is helping others do better research," said doctoral student Md Omar Faruk Rokon, who led the research.

Like developers of legal software, hackers often share their malicious creations in public archives such as GitHub. In a paper presented at the 23rd International Symposium on Research in Attacks, Intrusions and Defenses, or RAID 2020, the researchers described a supervised learning approach they used to scan 97,000 malware-related software repositories and located more than 7,500 malware source code repositories, producing possibly the largest malware source code database in the world.

"In most of the cases, malware authors seem to be notorious hackers who are persistent across different online forums marketing their attack tools,” said second author Risul Islam, a UCR doctoral student.

First, the group used malware-related keywords to find and subsequently download 1,000 repositories on GitHub. They investigated each repository thoroughly and labeled ones that all agreed were malicious and divided them into subsets. Next, they identified components of the repositories of one subset, such as words and features, and used them to train a supervised machine-learning algorithm on one of the subsets with high accuracy.

“It was fascinating to observe generalizable patterns in this vast source code database, which overwhelmingly point to malware. Even more exciting is that those patterns are interpretable and can offer direct insights to security analysts,” said co-author Vagelis Papalexakis, an assistant professor of computer science and engineering. “It is important to underscore here that formulating the problem and distilling those patterns is far from trivial, requiring meticulous design and experimentation by Rokon that combines domain knowledge and machine learning expertise.”

The researchers also identified trends among malware repositories:

Since 2010, the number of new malware repositories per year has more than tripled every four years.
Ransomware repositories, the malware responsible for blackmailing users and stealing personal information, emerged in 2014 and took off in 2017.
Most malware targets Windows and Linux operating systems, but there has been a notable increase in the number of repositories for MacOS and Internet of Things, or IoT, devices recently.
Keyloggers, which monitor the keyboard keystrokes and can steal passwords, are the most common type of malware among the repositories the group found.

"The implications are huge: Having such a large database of malware can really help security

researchers develop better defenses," said co-author Michalis Faloutsos, a professor of computer science and engineering. "It is an ugly war, and hackers have the first-mover advantage. We need all the information we can get to be prepared ahead of time.”

The paper, "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories," is available here.

^{Header photo: Santeri Viinamäki on Wikimedia Commons}