This year’s Spark + AI conference had many sessions focusing on the topics of security and threat detection. A few things at the conference caught my eye from a general best practices standpoint – below are some key takeaways and thoughts in regards to a new approach for feature extraction for threat detection modeling.
Model Exploitation and Poisoning
As we become increasingly reliant on models, we might also be opening ourselves up to new forms of exploitation that were not previously possible. Professor Dawn Song (UC Berkeley) described a few of these new attack patterns. For example, an attacker can attempt to compromise the integrity of a machine learning (ML) process by causing the learning system to produce either incorrect results, or an outcome designed by the attacker.
Causing a complex model to produce incorrect results could be achieved by perturbing the model input data in a systematic way, so as to produce model outputs that a human or other basic evaluation system would never produce. An example given was that a traffic stop sign with small but specific, graffiti-like alterations (which, to humans, would seem minor and not at all significantly different from other stop signs) can consistently cause an artificial visual recognition system of, say, an autonomous vehicle to read “45MPH speed limit” and increase speed instead of stopping.
Causing a prediction model to produce outcomes designed by the attacker, on the other hand, can be achieved through “poisoning attacks,” where falsified training data is injected. Crowdsourced data is a lot more likely to be susceptible to poisoning – Microsoft’s Tay Twitter bot may come to mind. Another session alluded to this idea as it relates to fraud modeling, explaining that in the fraud space, where trends change often and frequent retraining is desired, selecting small and recent training data could have an unintended effect. For example, fraudsters could be injecting positive labels gradually with eventual attack characteristics that never substantiate a breach until after model retraining occurs.
Models in the Wild
Another ML topic that raised security concerns is the rise of models that execute remotely on the client/device side. Rohan Kumar, Microsoft’s Corporate Vice President of Azure Data, described the Intelligent Edge – a cloud-based development framework where model deployment can be done seamlessly on either the cloud or devices.
The example application was visual object recognition for drones that may not have signal, or where bandwidth is limited for transmitting large amounts of visual information. In these cases, recognition models are embedded, and only high-level information is transmitted. The concern this raises is that a model could fall into the wrong hands and be repeatedly tested for unintended behavior (such as an incorrect response to an altered traffic sign), or reverse-engineered for other purposes.
Graph Embeddings as a New Security Tool
Dominique Brezinski, a security leader at Apple, presented an interesting new approach to feature extraction for threat and fraud model development. She described network graphs as “the peanut butter in a PB&J security stack.” At least three other talks focused on either graph feature extraction, or graphs as a vital security tool. The most relevant of these to money movement threat detection came from Venkatesh Ramanathan of PayPal. Venkatesh researches new methods of payment fraud detection. In particular, his efforts are focused on preventing collusion fraud, where complex buyer/seller interactions constitute collusive activity, resulting in significant monetary losses for their processor, such as PayPal.
After representing transactions flowing through a system as relationships of a graph database (where, for example, the buyer and seller are vertices and their various interactions are edges), the complexity of financial interactions does not lend itself easily to row/column-based data for model training. The developer needs to abstract the complexity to produce aggregate features from these interactions that can be used in ML models. Some obvious network patterns can be represented fairly easily, such as number of interactions, concentration of buyers per seller, etc. But to go deeper into the richness of financial networks, an automated approach is needed. One interesting approach used by PayPal is creating graph embeddings using the node2vec. This algorithm uses the word2vec analogy, and was initially developed by Aditya Grover and Jure Leskovec at UC Berkeley.
Much like word2vec, node2vec learns embeddings using the skip-gram model. The main difference is in how the corpus is generated and what a sentence is for graphs. While words in a sentence are directed, acyclic and unweighted, graphs can have some or none of these properties. Therefore a sampling approach of a node’s neighbors is performed through random walks – a common strategy for several other graph-representation algorithms, such as Page Rank. The length, number, and likelihood of wandering away from the root node are hyper-parameters for these random walks (in addition to the standard skip-gram model parameters). More specifically wandering is controlled by 2 parameters: Breadth-first Sampling (BFS) and Depth-first Sampling (DFS). So, as the sampling of a neighborhood of a vertex is performed, the transition from one vertex to the next depends on the weight of the vertex edge * alpha, where alpha depends on the mentioned hyper-parameters.
One large and relatively obvious caveat in automating graph feature extraction is that any such strategy depends completely on the graph schema containing the necessary domain information, and being structured so as to give the extraction strategy a fighting chance of being informative. As intuitive as graphs can be, there are still many choices to make to represent relationships. Automatically optimizing the graph structure, such as reassigning edge weights or deciding which edges to include, is currently an intractable problem. Node2vec, however, appears to be a very interesting way to generalize many structural properties of graphs, which can be powerful indicators in fraud detection.