Welcome to the Fifth entry about the big data, I hope to subscribe, interact and give you Resteem. Thanks @javf1016 :)
First Entry: https://steemit.com/business/@javf1016/big-data-1-10
Second Entry: https://steemit.com/education/@javf1016/big-data-2-10
Third Entry: https://steemit.com/education/@javf1016/big-data-3-10
Fourth Entry: https://steemit.com/education/@javf1016/big-data-4-10
3.1 Supervised learning (Decision trees)
The structure of supervised learning is based on a dataset *, which in this case is the information represented in the Big Data security that has been compiled in logs, historical, etc.
At the same time that the supervised learning is a classification of the machine learning this is classified in:
Regression
Classification
Supervised regression-type learning makes predictions on a continuous output, for example numerical values, which could be continuous recognition of an ip, of the ports used at different times of the day. The classification type unlike the regression type makes predictions of discrete type, for example categories, labels, which could be used applications, entry to a machine within the administration.
Hypothetically a system focused on detecting when a user account has been compromised and thus performing an intrusion within the systems of the corporation, we would have a client that would represent the access terminal to the systems and a server that would be the agent supported by the Big Data of the corporation. The client sends for each action a vector of N fields that is stored in the Big Data and after being interpreted by the agent proceeds with the prediction that in this case would be a block, which would require verification factors or just register in the log The anomaly, The first thing we are going to analyze is if it is really considered an anomalous intrusion in the system.
Vector input {String User, String Key, String Geolocation, Boolean Anomaly, String System}
We are going to perform a function with a set of input vectors, which are labeled, the fourth field of the vector has a class Boolean that indicates if the input is considered as anomaly, which would indicate if the input is blocked or do not.
Vector = {Admin, Admin, Bogota, true, Database}
Vector = {Admin, Admin, Cali, false, Database}
A decision tree is generated, depending on how the algorithm is developed, the leaves, nodes, tree depth will be created. But in essence it is the same, a map that guides the decision. Fig. 8
Fig. 8. Example decision tree for vectors
Accepting the model, in this case a decision tree, the agent will only have to traverse the tree through the values, until reaching the leaves. For example, for the first vector an Admin user client, with Admin key, that enters from Bogota to the databases, an action is denied to enter, since the field Anomaly marks True. Something to emphasize in this model is its response speed ("interpretation and prediction"), in this case the problem is of complexity V = {5}, which means that it can become V = {N}, where N The order to go within the tree, something for systems that require agile response, but with a high query rate.
3.1 Unsupervised Learning (Maps, Networks)
In unsupervised learning, circumstances are handled in which little is known or known. Two groups, focused on data grouping and associative memory, are classified in (2), which are:
Clustering: Group a collection of data by either values, genres, etc.
Association: Group a collection of data by historical activities or historical behaviors.
Although it seems complicated to implement at first, we will use a model called self-organizing maps (SOM), a model inspired by biology, specifically in neurons connected to the functioning of the sense of touch.
Hypothetically we are going to have our network sniffer register the traffic and communicate it to an input vector with the following values (frame buffer, TCP flag, IP), which will be sent to the model.
The model is created again, but in this case is not labeled, it is not known if something is good or bad.
IP = {192.168.1.2, 11.11.11.10}
TCP = {0.0.1.1.0.0.0.0.0, 0.1.0.0.0.0.0.0.0}
The model will be taken as a mesh, which through learning the nodes will fit on the vectors, causing the density of the slits to increase. Fig. 9
Fig. 9 Example of a network for vectors
This procedure is applied to each input vector. Until the mesh is fixed. The vector affects the mesh being associated to a concentration of nodes called cluster, as a function of distance and density, with respect to the point where it occurs.
Glossary
- String, basic data type, which indicates a string of characters.
- Boolean, basic data type, which has two options True (True) or False
nice :)
thanks!