You do what? On the Internet
Can you spy on someone’s online activities? Or is someone spying on you? There’s more to your online footprints that can reveal about you than you think you know.
In this series, we’ll learn how to build two simple AI to recognize online activities on popular apps. First, we will build a traffic classifier using machine learning to recognize types of inbound/outbound traffic. This is a multi-class classification model to identify apps; whether someone in your network is using Facebook or Instagram or reading on medium.com. We will start with 20 popular apps on the Appstore; but you can always add new traffic data to recognise different kind of services. Then, we will build a more sophisticated AI that can recognise at activity/event levels like chatting on Messenger, booking a room in Airbnb or shopping on Amazon. For simplicity, we will only train 5 uniques actions from Facebook, Medium and Spotify. Adding new activities is as simple as adding more labeled traffic from the apps you like your AI to learn on.
We will demonstrate and explain each of every step in building the AI; from data collection/preprocessing, to training/testing the model and making inferences using our AI. For hosting, we will write a new post to explore on several options including AWS cloudfront or Google Kubernetes. This guide is intended and has been simplified to be intuitively explained especially for people who are new to AI. This series is composed of four posts: (#1) the intuition and some machine learning basics; (#2) Getting the dataset and data -pre-processing; (#3) training your network AI and (#4) hosting your AI on the cloud. So, are you ready for it?
Post (1/4)
A Layman Introduction to Machine-learning
AI has been widely used in many domains these days; for image classification, movie recommendation, chatbots etc, you name it. One of the many reasons is that AI helps us to process large amount of data to find patterns we otherwise couldn’t. And these ‘patterns’ are the keys used by machines to differentiates between entities/classes. In traffic classification, the challenges is to model traffic behaviours (using network packets) to discriminate packets/flows into their respective classes (like Facebook/Instagram/Medium). Technically, traffic engineering can happen at packet level or flow level; but for simplicity we’ll use them interchangbly in this post.
So how can a machine know if an IP packet belongs to Facebook or Instagram? Does machine sees things like the way we do? Now, let’s consider a classic example of image classification for dog/cat classification. For the AI to determine if an image is showing a cat or a dog; the AI first need to learn from some images labeled with either cat or dog (training data). In computer vision, each of such images are represented in 3d arrays of (height, width, color channels) on the machine. Convolutional Neural Network (CNN) is normally used to find important pixel information from these images; known as feature maps; through a series of convolution process. These feature maps are eventually fed into fully connected layers (hidden layers) to train the AI to recognise dog from cat based on eyes colours, fur types etc. So, NO; machine do not see data like how we see it; so we need to be smart in preparing the data for machine algorithms to pick up discriminative patterns.
Meanwhile, in traffic engineering; we are dealing with text data most of the times. Some common data sources are: (a) network logs from firewall/IPS/IDS, and (b) network traffic from network sniffer like Wireshark. Text based data is easier to be manipulated; they are also less compute intensive during the train/test process. However, there aren’t too many resources out there; that means, less github codes and open source tools for us to pre-process network traffic; and less pre-trained models for us to do transfer learning (we will get into this later).
The Intuition
Now, try saying ‘hey Siri, good morning’ and then saying ‘ok Google, good morning’. On the iPhones, Siri will just greet back good morning; meanwhile, Google will show you some essential information to kick start your morning (appointments, weather etc). Some AI just understand more contexts, support contexts carry-over and well integrated. On the same logic, some traffic classifier can distinguish more types of traffic, some are more accurate, and some are a bit of both. So, what makes one AI smarter than another?
A data scientist would tell you that there are several components that collectively influence the accuracy of the resulting model. We will break each of them down and explain on how to deal with each stages carefully.
Dataset — as the saying goes ‘rubbish in, rubbish out’ applies here. A good AI can only be modeled from good quality dataset. So what does the quality here means? My intepretation is that (1) you need a dataset large enough for the AI to capture some patterns; for eg, your AI probably can’t tell between a cat from a bird if you trained it with cat/dog dataset. (2) your dataset must have enough good samples so you have larger features’ vector; for eg, if you capture packets when user is idle on Facebook, then your AI probably can’t recognize some Facebook traffic generated from scrolling newsfeed. (3) you have attempted to clean some of the noise from the dataset; for eg, if you capture for Facebook packets on Wireshark and all the background control protocols like SSDP/discovery packets are used to train; then you are training your AI to recognize them as Facebook traffic. And (4) the dataset should have balanced samples across multiple classes; for eg, if you train a binary model for Facebook/Instagram classification with 1000 FB packets and 50 IG packets then most likely the model will be biased towards Facebook.
Features — here's an interesting one. Have you ever wondered why KFC’s fried chicken tasted better than McD’s? The secret lies in, well, their secret recipes. KFC could have used more herbs, or more of the right herbs; or combination of the right herbs in the right ratio compared to McD? That golden ratio, or the golden recipe is also true for an accurate AI. In data science, this translates to what are the right information we should keep in our dataset for model training.
For our example, consider we have a dataset of 1000 packets each, for Facebook and Medium. Our intuition is that since every apps are developed for different usage; they will invoke different user and system usage behaviours of network resources. For that, we hypothesise that the packet size generated by Facebook could be larger (around 1500 bytes on average) compared to Medium (around 600 bytes on average). If this is true (well the machine will figure this out), then we can filter the ‘packet size’ field from the 2000 training samples for the training (see post #2 on data preprocessing). The process of identifying which feature(s) to be used, and which to be discarded is called ‘feature selection’. The process of exporting/filtering the chosen features is called ‘feature extraction’. Now, if we were right and if packet size is a good discriminator, then an AI will be built around trying to figure out the classes of new packets based on their packet size. For example, if an unknown packet received has 568 bytes, then the AI will classify this packet as a packet from Medium. Figure 1 visualises this intuition (note that packet size fluctuates so it’s good enough to have a distinctive average in packet size).
Remember that this is just a naive example. Now consider a multi-classes model for FB, Medium and IG classification? Can we just add new train data (traffic from IG) and retrain the model? Yes, if we are lucky; but our odds are slim here. Suppose that the average packet size of IG is 1500, and FB average packet size is also 1500; then the AI will have a hard time to distinguish between FB and IG (lots of false positives here). Fortunately, and this is why data scientist normally use multiple features (called feature sets) to train an AI model. In our case, perhaps adding ‘packet arrival time (pat)’ as feature #2 can be helpful. For example, FB servers might be hosted in regional sites while IG contents are hosted in CDN; that means IG packets would generally have lower ‘pat’ due to lower client-server latency. So, just like KFC finding the right fried chicken recipe; part of the job of data scientist is in finding the right features (using domain knowledge) to train a better AI.
p.s. it is fun finding the golden recipe; but in modern AI much of these efforts can be automated using algorithms like Correlational Feature Selection (CFS) and Information-Gain (IG). In deep learning, much of these features selection is abstracted since the neural networks (work like a black box) will figure these out through feed-forward and backward propagations.
Machine-learning Algorithms — it wouldn’t be ML if the right choice of ML algorithms don’t play a part towards an accurate AI. Like KFC finding the right recipes, they also need the right way to fry their chickens coated in these recipes. What’s the right temperature? How long should the chicken stay in the fryer?
In context of ML, there is no standard rules to choosing the right classifier; in fact, there is some optimal classifier for every types of problems. So, mostly through trial-and-error then we will find the suitable ones. Data scientist would say ‘guess i got lucky’ when they hit a jackpot; but most often these come from their experiences in building many types of AI using different kind of datasets. For example, linear regression is suitable to model continuous data such as to predict real-estate prices; but it fall flat if used for cat/dog classification. Here’s a few tips for beginners. The first rules is to be sure if you are building a regression model or a classification model. Second, always check if you are building a binary or a multi-class model; for example, use a decision tree instead of logistic regression to build multi-classes models.
Ok, if you are feeling ‘there got to be more than this, this sounds way too easy for me, something must not be right?’, yeah your gut feeling is right. Most often than not, data scientist go deeper into the classifiers, and tune them with optimal parameters/configs to squeeze more performance out of algorithms. This is call hyper-parameters tuning. To really take advantage, we need to read their corresponding documentations (highlight the syntax and press ctrl+i in Spyder) to understand what are the options that we can tune. For example, the number of trees in random forest can be increased to improve accuracy (until a diminishing point when overfitting occurs). A friendly advice is not to over-tune, as in some cases the default parameters is already good enough; and it might be detrimental if we overdo it/do it wrongly.
Now, if AI is so smart; then why no one’s using AI to build AI? In cases where the dataset is too big and training takes too long; trial-n-error become less feasible. A faster way to find the best fit algorithm can be automated using ensemble techniques; which is to combine several well performing algorithms as one (more commonly known as stacking or baggings); so that we have a composite classifiers that can cover each others ground. And if you have ever thought of automating the entire pipeline, then it is worth checking out the increasingly popular deep learning techniques that take most of the heavy lifting out from feature engineering (and even the coding is greatly simplified IMO).
Until now, we have looked into the components in building an accurate AI and how machine differentiates traffic classes. So, if you are ready, then click on the link to post #2 and LET’S GET PYHSICAL.
developing …