Machine learning: the new frontier in zero day defence

How to train your computer to sniff out malware

The cybersecurity industry is a battlefield. Black hats constantly try to find new vulnerabilities in computer systems, and vendors rush to identify and fix them. The user is caught in the crossfire

At the front line of this battle is the zero-day attack. Zero-day vulnerabilities are security holes in systems that malicious actors have found but which have not yet been patched by vendors. Cyber criminals can exploit these vulnerabilities to create attacks that users can’t prevent. Until the vendor issues a patch, the user is unprotected.

What’s even worse is that the anti-malware tools those users bought to protect themselves probably won’t stop the attacks either.

Antivirus software evolved in the late 1980s in response to a nascent hobbyist industry creating viruses and worms. Antivirus tools scanned files on local machines and compared them to a growing list of signatures – the digital footprint made by the file when it is written to disk, represented as hashes.

This worked up to a point but these days signature-based scanners are fighting a losing battle. Attackers developed packers, obfuscators and polymorphic algorithms to change the footprint of the files, making a successful match more difficult.

Compare and contrast

The number of malware strains that the scanners must index and compare has also grown exponentially, especially since malware writers began using their programs to generate revenues. AV-Test, an independent antivirus testing laboratory, discovered almost 120 million new malware strains in 2016 alone, putting the total number of malware strains in the wild at 600 million.

Zero-day attacks present a further challenge: because they have not been seen in the wild before their file signatures are simply not available, and the attacks are growing in number. Symantec documented 125 per cent more zero-day vulnerabilities in 2015 than in the previous year.

Entire marketplaces have even developed to buy and sell information about these vulnerabilities so that malicious actors can get ahead of their targets.

Zero-day exploits often focus on specific high-value campaigns. Examples include the Stuxnet worm which targeted the Iranian Natanz uranium enrichment plant in 2010, and Aurora, which attacked companies such as Google and Adobe via older versions of Internet Explorer in 2009.

Some zero-day attacks target groups of individuals rather than institutions. In 2014 attackers compromised the US Veterans Of Foreign Wars website. The attackers added an iFrame into its HTML code to deliver an Adobe Flash object from a malicious website. The Flash object exploited an unknown bug in Microsoft IE 10 that installed a backdoor program.

Beware the Fancy Bear

The incident, labelled Operation SnowMan, was a watering hole attack designed to infect those visiting the site. Launched shortly before Presidents’ Day, it was thought to be a targeted attack on military service members, possibly to steal military intelligence.

More recently, Russian hacking group APT28, also known as Fancy Bear, has been associated with the hack on the Democratic National Convention, using zero day vulnerabilities to mount its attacks. It also targeted almost 1,900 individuals in 2015 using multiple zero-day vulnerabilities in Windows, Adobe Flash and Java. Targets included ministries of defence, political leaders and high-ranking NATO officials.

Long ago, antivirus companies began using heuristics as another method to help them spot malware, including zero-day attacks. It uses techniques such as analysing the contents of the file to try and work out what it might do, or perhaps running the file in a sandbox environment to watch its behaviour. If it acts like a virus, a heuristic engine might decide that it is one.

Heuristic analysis is problematic, though. It is still a form of signature scanning, because the heuristic engine must have a database of suspicious behaviours to compare a file against. File analysis can also be demanding on system resources, and it has been known to create false positives. In some cases, antivirus tools have crippled legitimate Windows systems files after scanning algorithms failed.

All of this has led some to pronounce antivirus tools effectively dead. A Google security expert recently advised people to stop investing in antivirus altogether. That might be a little drastic, but the trend is clear: it’s time for a new approach.

People know best

If it’s so difficult for automated anti-malware tools to check every specific malicious file or behaviour, perhaps they shouldn’t try. Instead, they should learn to think like people.

Cybersecurity experts tend to use their own street sense. They have been around long enough to learn what behaviours might put them at risk, and machine learning effectively enables computers to absorb what human experts know.

Whereas conventional antivirus algorithms compare files and behaviours against a long, known list, this branch of artificial intelligence explores millions of historical files that have already been categorised as malicious or benign. It extracts a vast panoply of technical characteristics from each of these files, then uses them to learn what to look for in a malicious file and what constitutes a benign one.

Companies can set the machines to work, constantly updating and revising their understanding of what constitutes malicious behaviour. When they scan new files, they compare them against a statistical model that is far more manageable than an unwieldy and resource-hungry list of hashes.

Not only can this lead to better performance, but it also improves the likelihood of detectiong zero-day threats. Combinations of behaviours that may not generate a hit in a conventional heuristic scanning system may trigger an alert in a machine learning solution.

Given the industry’s history of compromises and data breaches, it’s clear that the old models aren’t working. Machine learning is already helping us to automate other tasks that traditional algorithms wouldn’t be very good at, such as recognising speech or even driving cars.

It’s time to let these algorithms loose in a world of rapidly-evolving malware.