In the increasingly interconnected digital landscape, databases are the backbone of countless applications, storing everything from financial records to personal information. Protecting this data is paramount, especially when considering the potential consequences of unauthorized access. One of the more subtle threats comes from misuse or exploitation of service accounts—privileged accounts used by applications or scripts to interact with databases. These accounts often have significant access and can become a prime target for malicious actors. In this post, I’ll explore how to detect potential fraudulent access through these accounts using a Markov model for anomaly detection.

The Challenge: Securing Service Accounts

Service accounts are designed for automation and typically perform repetitive and predictable actions. For example, a service account might regularly read data from specific tables, insert records into another set of tables, and update others. This repetitive nature is both a strength and a vulnerability. While it simplifies automation, it also creates a predictable pattern that, if deviated from, could indicate unauthorized access or fraudulent activity.

However, detecting these deviations isn’t straightforward. Logs generated by service accounts can be massive, and manually sifting through them to spot irregularities is impractical. Moreover, legitimate changes in application behavior can cause the service account to access new tables or perform different actions, complicating the detection of true anomalies. This is where statistical models like the Markov model come into play, providing a way to model normal behavior and flag deviations as potential anomalies.

What Is a Markov Model?

A Markov model is a mathematical model that represents a system where the probability of moving from one state to another depends only on the current state, not on the sequence of events that preceded it. This characteristic, known as the Markov property, makes it particularly useful for modeling time series data and sequential processes—like the sequence of database actions performed by a service account.

Key Components of a Markov Model:

  • States: In our context, a state can be defined by the combination of an action (e.g., SELECT, INSERT, UPDATE) and the specific database table it acts upon (e.g., Table_A, Table_B).
  • Transitions: These represent the movement from one state to another. For example, a transition might involve moving from a SELECT action on Table_A to an INSERT action on Table_B.
  • Transition Probabilities: These probabilities capture the likelihood of moving from one state to the next based on historical data. They form the core of the Markov model, enabling us to quantify what “normal” behavior looks like.

Step-by-Step Implementation

Step 1: Simulating Database Audit Logs

To demonstrate how a Markov model can be used for anomaly detection, I first created a sample dataset simulating the actions of a service account over time. The dataset includes typical database operations such as SELECT, INSERT, UPDATE, and DELETE, performed on various tables.

import numpy as np
import pandas as pd
from collections import defaultdict

# Sample Data Creation
data = {
    "timestamp": pd.date_range("2024-08-01", periods=20, freq="H"),
    "action": [
        "SELECT", "SELECT", "INSERT", "SELECT", "SELECT", 
        "UPDATE", "SELECT", "SELECT", "INSERT", "SELECT",
        "SELECT", "DELETE", "SELECT", "UPDATE", "SELECT", 
        "SELECT", "INSERT", "SELECT", "UPDATE", "SELECT"
    ],
    "table": [
        "Table_A", "Table_A", "Table_B", "Table_A", "Table_A",
        "Table_C", "Table_A", "Table_A", "Table_B", "Table_A",
        "Table_A", "Table_B", "Table_A", "Table_C", "Table_A", 
        "Table_A", "Table_B", "Table_A", "Table_C", "Table_A"
    ]
}

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Display the sample data
df.head(10)

This initial dataset represents the normal behavior of a service account. It simulates the expected database operations over a period of time, capturing how the account interacts with different tables.

Step 2: Building the Markov Model

With the dataset in place, the next step is to construct the Markov model by analyzing the transition probabilities between different states. Here, each state is defined by a tuple combining the action and the table being accessed.

# Building the Markov Model

# Define states as tuples of (action, table)
states = list(zip(df["action"], df["table"]))

# Initialize the transition matrix as a defaultdict
transition_matrix = defaultdict(lambda: defaultdict(int))

# Populate the transition matrix based on the sample data
for (current_state, next_state) in zip(states[:-1], states[1:]):
    transition_matrix[current_state][next_state] += 1

# Convert the transition counts to probabilities
transition_probs = defaultdict(dict)
for current_state, transitions in transition_matrix.items():
    total_transitions = sum(transitions.values())
    for next_state, count in transitions.items():
        transition_probs[current_state][next_state] = count / total_transitions

# Display the transition probabilities
transition_probs

In this model:

  • States: Each state is a combination of an action and a table, such as (SELECT, Table_A).
  • Transitions: The model tracks how often the service account transitions from one state to another, such as moving from reading Table_A to inserting into Table_B.
  • Probabilities: These are calculated by dividing the number of times a particular transition occurs by the total number of transitions from the current state. This probability distribution forms the foundation of our anomaly detection system.

The resulting transition probabilities provide a detailed map of what constitutes normal behavior for the service account. For example, a common pattern might involve reading from Table_A multiple times before updating Table_C. These patterns are captured in the model and will serve as the benchmark against which new sequences are measured.

Step 3: Simulating Anomalous Behavior

With the Markov model built, the next step is to evaluate new sequences of actions and determine whether they are likely or anomalous. To do this, we simulate a new set of actions, some of which may represent normal behavior, while others might indicate potential fraud.

# Simulating New Data for Anomaly Detection
new_data = {
    "timestamp": pd.date_range("2024-08-02", periods=5, freq="H"),
    "action": ["SELECT", "INSERT", "DELETE", "SELECT", "UPDATE"],
    "table": ["Table_A", "Table_B", "Table_B", "Table_A", "Table_C"]
}

# Create a DataFrame from the new data
new_df = pd.DataFrame(new_data)

# Define the states for the new data
new_states = list(zip(new_df["action"], new_df["table"]))

# Calculate the likelihood of the new sequence
def calculate_sequence_probability(states, transition_probs):
    probability = 1.0
    for (current_state, next_state) in zip(states[:-1], states[1:]):
        if next_state in transition_probs[current_state]:
            probability *= transition_probs[current_state][next_state]
        else:
            # If the transition was never observed, assign a low probability
            probability *= 0.01
    return probability

# Calculate the probability of the new sequence
sequence_probability = calculate_sequence_probability(new_states, transition_probs)

sequence_probability

In this example, we simulate a new sequence that includes a mix of common actions (e.g., SELECT on Table_A) and less common ones (e.g., DELETE on Table_B). By feeding this sequence into the Markov model, we can calculate its overall probability.

The probability is computed by multiplying the transition probabilities along the sequence. If the model encounters a transition it has never seen before, it assigns a very low probability to that transition, significantly lowering the overall probability of the sequence. A very low probability indicates that the sequence is unlikely based on historical data and may warrant further investigation.

Step 4: Setting a Threshold for Anomaly Detection

A crucial aspect of anomaly detection is setting a threshold that defines what constitutes an anomaly. This threshold can be determined by analyzing the distribution of probabilities from historical sequences.

# Calculate the probabilities of historical sequences
historical_probabilities = []
for i in range(len(states) - 2):
    sub_sequence = states[i:i + 3]
    probability = calculate_sequence_probability(sub_sequence, transition_probs)
    historical_probabilities.append(probability)

# Convert to a pandas Series for easier analysis
historical_prob_series = pd.Series(historical_probabilities)

# Display basic statistics and a threshold at the 5th percentile
threshold = historical_prob_series.quantile(0.05)
historical_prob_series.describe(), threshold

By calculating the probabilities of sequences in the historical data, we can set a threshold at a particular percentile (e.g., the 5th percentile). This threshold represents the lower bound of what is considered normal behavior. Any sequence with a probability below this threshold would be flagged as anomalous.

The choice of percentile is crucial and depends on the acceptable level of risk. A lower percentile (e.g., 1st percentile) will flag more sequences as anomalous, increasing the sensitivity of the detection system but also potentially increasing false positives. Conversely, a higher percentile (e.g., 10th percentile) might reduce false positives but could miss some true anomalies.

Step 5: Applying the Threshold to Detect Anomalies

Finally, we apply the threshold to the new sequence of actions. If the calculated probability of the sequence is below the threshold, it is flagged as an anomaly

.

# Check if the sequence probability is below the threshold
is_anomaly = sequence_probability < threshold
is_anomaly

In this case, the new sequence is indeed flagged as anomalous, indicating that it deviates significantly from the normal behavior captured by the Markov model. This flagging would trigger an alert for further investigation by the security team.

Expanding the Model: Handling Complex Scenarios

While the basic Markov model is a powerful tool, real-world scenarios often involve complexities that require further refinement. Here are a few ways to enhance the model:

  1. Handling Seasonality: In some cases, the behavior of service accounts may vary depending on the time of day or the day of the week. Incorporating temporal factors into the model can help differentiate between normal variations and true anomalies.
  2. Multi-Level Markov Models: For more complex systems, a single-level Markov model might not be sufficient. A hierarchical Markov model can be used, where each level captures different granularity of actions. For example, the first level might capture high-level operations (e.g., data retrieval vs. data modification), while the second level captures specific actions within those categories.
  3. Combining with Other Models: Markov models can be combined with other machine learning models, such as clustering algorithms or neural networks, to create a more robust anomaly detection system. This hybrid approach can help capture anomalies that a Markov model alone might miss.
  4. Incorporating Contextual Information: Beyond just actions and tables, including contextual information like the origin of the request (IP address, user agent), time of the request, and the size of the data being accessed can provide a richer context for the model. This information can be encoded into the states or used as additional features in a more complex model.

Practical Considerations and Limitations

While Markov models offer a powerful method for detecting anomalies in service account activity, there are several practical considerations to keep in mind:

  • Data Volume: In high-traffic environments, the volume of audit logs can be enormous, making real-time processing challenging. Efficient data handling and processing techniques, such as stream processing frameworks, may be necessary to scale this approach.
  • Adaptation Over Time: Service accounts’ behavior may evolve as applications are updated or workflows change. It’s crucial to periodically retrain the Markov model with new data to ensure it remains accurate and effective.
  • Threshold Tuning: Setting the right threshold is critical. Too high, and you may miss real threats; too low, and you might overwhelm your security team with false positives. Regularly reviewing and adjusting the threshold based on the changing environment is essential.
  • Interpretability: One of the strengths of Markov models is their interpretability. Security analysts can easily understand and investigate why a particular sequence was flagged as anomalous, based on the transition probabilities. However, as models become more complex (e.g., incorporating hierarchical structures or being combined with other models), interpretability can diminish, making it harder to justify decisions to stakeholders.

Conclusion

Using a Markov model for anomaly detection in database service account activity provides a structured and statistical approach to identifying potential security breaches. By modeling normal behavior and setting a threshold for what is considered anomalous, organizations can proactively monitor and respond to suspicious activities, protecting critical data from unauthorized access.

This method, while powerful, is just one part of a comprehensive security strategy. It should be used alongside other security measures, such as access controls, encryption, regular audits, and employee training, to build a resilient defense against threats.

By continuously refining the model and adapting to changes in the environment, organizations can maintain a robust anomaly detection system that evolves with their needs. In an era where data breaches are increasingly sophisticated, having a proactive and adaptive security strategy is not just advisable—it’s essential.

This detailed guide offers a practical approach to using Markov models for anomaly detection in database security. Whether you’re managing sensitive data or overseeing critical applications, integrating such models into your security protocol can help safeguard your operations against unauthorized access. As always, refining these models with real-world data and integrating them into a broader security framework will yield the best results.

If you’re interested in trying out the concepts discussed in this post, you can download the full Jupyter notebook here on GitHub. This notebook includes all the code examples and step-by-step instructions to help you implement a Markov model for anomaly detection in your own environment.