Using Artificial Intelligence/ Machine Learning to Detect Domain Generation Algorithms

 

Malicious actors are continually finding new ways to avoid detection in today’s world of ever-evolving cyber-threats. The more dynamic they are in their strategy, the more successful they are in getting past defenses that use static methods, like rarely updated blacklists. In this article, we cover one such dynamic tactic used by cybercriminals: Domain Generation Algorithms (DGAs) and some unique and cutting-edge methods that leverage artificial intelligence/machine learning (AI/ML) to counter these threats and discuss their false detections. Are you new to Machine Learning? Find the best Machine Learning for beginners training, and get hands-on experience today!

 

Learning of the Blog

  • Domain Generation Algorithm – What is It?
  • Detecting Artificial Intelligence / Machine Learning DGAs
  • Conclusion

 

Are you interested in detecting DG Algorithms? Take up an AI ML Certification and become a Certified Machine Learning Expert or an Artificial Intelligence Developer. Let’s see more about Domain Generation Algorithms.

Domain Generation Algorithm – What is It?

DGA is a technique that fuel attacks on malware. DGA alone can’t do you any harm. But it’s a proven technique that allows modern malware to evade countermeasures and security products. Attackers use DGA so that they can switch the command-and-control servers (also called C2 or C&C) they use for malware attacks. Now DGA has matured into one of the top phone-home mechanisms to reach C2 servers by malware authors. That poses a significant threat to cloud security.

 

DGA is defined as “Using malware algorithms to generate a large number of domain names regularly that function as rendezvous points for malware commands and control servers.” DGA generates domains at its core through the concatenation of pseudo-random strings and a TLD (e.g., .com, .org, .cc). The malware queries the domain to detect if, in response, it gets a valid IP.

 

Attackers can use this simple mechanism to avoid hard-coding C2 IPs or domains into the malware code, which becomes useless after traditional filtering mechanisms have blocked.

Attackers now only need to register a single domain from the thousands a DGA produces. This single domain then becomes a rendezvous point for embedding a DGA for malware, botnet, or backdoor. 

Detecting Artificial Intelligence / Machine Learning DGAs

Here are two ways of identifying DGAs:

  • Reverse Engineering: If you already know the algorithm or can reasonably guess the algorithm based on limited data samples, then calculate the next domain.
  • AI/ML: Start with observing a large number of DGA and non-DGA domains, extract statistical features from them, and decide whether a particular one is a DGA. You do not need the source code of malware for the AI / ML method, because you observe known DGAs to predict unknown DGAs. There are lots of DGA and non-DGA domains readily available for training the model. And while the level of trust may not be 100 percent, it is still useful. 

 

You see the following differences straight away:

  • DGA domains are often more extended (the “length” feature in AI / ML terms), although some DGA families are as short as five characters.

 

  • DGA domains contain random sequences of characters, while non-DGAs appear to follow a lexical pattern.

 

Other differences that aren’t as obvious are:

 

  • That most DGA domains are not solvable: that is, they do not map to IP addresses because only a few of them are registered and machine-used.

 

  • Even registered DGA domains have a short life span (often just a few days)

 

All of these factors are taken into account when analyzing DGAs in the AI / ML models. 

  1. Some of the analytical methods are easy (e.g., calculating the domain length)
  2. Some are harder (e.g., analyzing the character pattern or performing lexical analysis)
  3. Some data are straightforward to obtain (e.g., whether the domains are solvable), and
  4. Some data require observation of facts over time ( e.g., domain lifespan). 

 

For calculating “points” each of these statistical characteristics is used, and the points are aggregated. If the total points exceed a certain threshold (which is adjustable or “tuned”), the analyst will judge whether or not that domain is a DGA. Some of the features are statistically more significant than others. The fact that most DGA domains are unresolvable is one of the most important factors because it gives us a prior likelihood before we observe the other functions. Such a prior likelihood, if used correctly, could significantly improve detection confidence.

 

Signature- or reverse engineering-based approaches require signature knowledge or access to source code, which can take a long time to complete. By contrast, when malware changes, a machine learning model can quickly retrain as new training samples become available, usually in a much shorter time.

Dealing with False Detections (Positive and Negative)

We can have false judgments in DGA detections, just like with weather forecasting. The root cause of false detection is that the AI / ML method is empirical and not entirely logical. There is always data that we do not have or do not know how to best use or, in some cases, is simply an anomaly.

The false detections can be divided into two categories: 

  1. False positives: A false positive happens when we detect a domain as a DGA, but it’s not. This is more often the case when domain names in foreign languages are long or composed.
  2. False negatives: A false negative indicates that we don’t detect a domain as a DGA, but that’s it. This failure occurs more often when the characteristics are not strong enough to cross a certain threshold, which is used by the AI / ML model to declare a domain based on DGA.

 

Conclusion

DGA detection is one of the network security challenges that is challenging to solve with standard signature-based detection methods. The application of artificial intelligence/machine learning to solve this problem demonstrates excellent promise, although that technique is still in its infancy. As with weather forecasting, more sensors are to be developed (machine learning features) and more data to be collected before we overcome some of the limitations to achieving higher detection precision. But from the evidence gathered up to now, threat detection based on AI / ML is unquestionably a massive step in the right direction.