Please note: This master’s thesis presentation will take place online.
Navid Malekghaini, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Raouf Boutaba
Deep learning models have shown to achieve high performance in encrypted traffic classification. However, when it comes to production use, multiple factors challenge the performance of these models. The emergence of new protocols, especially at the application-layer, as well as updates to previous protocols affect the patterns in input data, making the model’s previously learned patterns obsolete. Furthermore, proposed model architectures are usually tested on datasets collected in controlled settings, which makes the reported performances unreliable for production use.
In this thesis, we start by studying how the performances of two high-performing state-of-the-art encrypted traffic classifiers change on multiple real-world datasets collected over the course of two years from a major ISP’s network, Orange telecom. We investigate the changes in traffic data patterns highlighting the extent to which these changes, a.k.a. data drift, impact the performance of the two models in service-level and application-level classification. We propose best practices to manually adapt model architectures and improve their accuracy in the face of data drift. We show that our best practices are generalizable to other encryption protocols and different levels of labeling granularity. However, designing efficient model architectures and manual architectural adaptations is time-consuming and requires domain expertise. Neural architecture search (NAS) algorithms have been shown to automatically discover efficient models in other domains, such as image recognition and natural language processing. However, NAS’s application is rather unexplored in Encrypted Traffic Classification.
We propose AutoML4ETC, a tool to automatically design efficient and high-performing neural architectures for Encrypted Traffic Classification, given a target dataset and corresponding features. We define three powerful search spaces tailored specifically for the prominent categories of features in the Encrypted Traffic Classification state-of-the-art, i.e., packet raw bytes, flow time-series, and flow statistics. We show that a simple search strategy over AutoML4ETC’s search spaces can generate model architectures that out-perform the state-of-the-art Encrypted Traffic Classification models on several benchmark datasets, including real-world datasets of TLS and QUIC traffic collected from a major ISP network. In addition to being more accurate, the AutoML4ETC’s architectures are significantly more efficient and lighter in terms of the number of parameters. We further showcase the potential of AutoML4ETC by experimenting with state-of-the-art NAS techniques and model ensembles generated from different search spaces. We also use AutoML4ETC to analyze the state of adoption of the QUIC protocol.