Social Network Analysis 3: Basic Concepts and Methods of Privacy Attacks, Protection in Social Networks + De-Anonymization Techniques + Inferential Attack Techniques + k-Anonymity + Clustering-Based Privacy Protection Algorithms

put at the forefront
Social Networking Privacy Breach
Social Networks User Data Privacy
Social Network Privacy Attacks
State of the Art in Privacy Protection Research
Current status of research on privacy protection of social network users
Common privacy-preserving models of anonymization
Clustering-based privacy-preserving algorithms
Design of clustering-based privacy-preserving algorithms for social networks
outline questions

put at the forefront

The course “Social Network Analysis” is taught by Mr. Hongwei Lu, whose teaching style is not only rigorous and responsible, but also full of humor and personal insights. This direction is also particularly attractive to me, and I took this course with great interest. III. Social network privacy protection Mainly in conjunction with PPT Chapter 3: Social Networking Privacy Protection This chapter provides a brief introduction to the basic concepts and methods of social network privacy attacks and protection Welcome to the Social Networking blog series, Social Network Analysis 3 (below). With this blog, we hope to provide a more comprehensive perspective to understand privacy protection issues in social networks, approaches, and the latest technological developments in these areas. Social networks, as an important part of the modern Internet era, not only provide a platform for us to communicate and share with each other, but also become a valuable resource for big data and sentiment analysis. However, with the increasing popularity of social networks and the explosive growth of data volume, user privacy protection has become an issue that cannot be ignored. This blog will analyze the privacy leakage problem in social networks from multiple perspectives, exploring the reasons behind, possible attacks and countermeasures.

Social Networking Privacy Breach

In this digital era, social network privacy leakage has become a global problem. There are various ways to expose user data, including but not limited to direct leakage of personal data, privacy risks of complex behaviors, challenges brought by technological development, and data sales driven by economic interests. Understanding these leakage pathways will help us better take effective preventive measures.

Social Networking User Data Privacy

The problem of user data privacy in social networks is multifaceted. We will discuss various forms of social network privacy attacks, such as background knowledge-based attacks, node and inter-node relationship recognition attacks, as well as affiliation attacks and probabilistic attacks. In addition, we will explore the current state of the art in privacy protection research, as well as the application of social network anonymization techniques and the challenges they face.

Social Networking Privacy Breach

Social networks are an important part of modern life, but they can also be a hotbed of personal privacy leakage. This paper discusses the main ways of privacy leakage in social networks and their potential risks, and proposes corresponding preventive measures.

Pathways to User Data Exposure

personal profile page: Basic information such as the user’s gender, age, and contact information may be displayed directly.
Geographic location information: Users may inadvertently reveal their real-time location through behaviors such as clocking in.
Social Circle Analysis: By analyzing a user’s friend connections, an attacker can surmise the community the user is a part of and further deduce sensitive information.

Privacy Risks of Complex Behavior

Cross-platform behavioral correlation: A user’s behavior on different platforms may be correlated and analyzed, e.g., correlating a user’s profile on Weibo with his/her statements on Zhihu to reveal more personal information.

Privacy challenges posed by technological developments

graph neural network: This advanced technology is able to effectively analyze graph-structured data from social networks, thereby increasing the threat to user privacy.
Multimodal information fusion technology: By fusing user behavioral data across different platforms, attackers are able to perform targeted analysis of complex, multifaceted and heterogeneous data.
Machine Learning in Data Analytics: This could be a tool for attackers to use social network data for privacy attacks.

Financial interests and data sales

Social network service providers may sell user data to third-party organizations for profit-making purposes. For example, Facebook sold the private data of 87 million users to the University of Cambridge without informing them.

hedge

Increase user privacy awareness: Users should be aware of the possible risks associated with the information they share on social networks.
Strengthening of laws and regulations: More effective laws and regulations are needed to protect personal data from misuse.
Social platform technology improvements: Service providers should take technical measures to safeguard the security of user data.

Social Networks User Data Privacy

 Privacy is the object of research and protection in social network privacy protection, and before determining a privacy protection program, the privacy under study needs to be defined and the direction and goals of the research need to be determined.  Different privacy goals may generate different problem definitions and consequently lead to different privacy protection approaches. In privacy protection in relational databases, attributes in a table are categorized into non-sensitive and sensitive attributes. The values in the sensitive attributes are private to the individual, which is what is referred to as privacy.  Similarly in social network data, information such as node presence and attributes in the network may become subject to protection, and the following list of attributes that may become private in the network. Attributes in social network data that may be considered private include:

Vertex Existence and Properties: The existence of individuals in a social network and their attributes, such as the degree of vertices, can be considered as personal privacy.
Sensitive Vertex Labeling: Certain tag attributes of individuals in social networks, such as time of commuting to work, route to work, etc., which are sensitive tag attributes are considered to be private.
link relationship: Edges between vertices in a social network represent relationships between social individuals, and such linking relationships may be private to be protected.
Sensitive Edge Labeling: The attributes that edges between individuals have are considered as labels. The values of these sensitive edge labels are sometimes considered as targets for research.
Graphical parameters: Things such as intermediacy, proximity, centrality, path length, reachability, etc., which indicate a node’s relationship with or position in the social network, can be considered as sensitive information or privacy.
Link Weight: In social networks, the weights of edges can indicate the closeness of relationships between individuals, such as the closeness between friends, or even the amount of communication, which can also be considered as a form of privacy.

These attributes reflect the multidimensionality and complexity of privacy protection in social network data.

Social Network Privacy Attacks

A social network privacy attack is an attack that exploits information posted by users on social networks. In this paper, we will introduce two main types of attacks: background knowledge based attacks and node and inter-node relationship identification attacks.

Attacks based on background knowledge

Definition of background knowledge: These are information that is insufficient to identify the target user when used alone, but when combined with other information may be used to infer sensitive information about the user.
course of an attack: The attacker first obtains quasi-identifier attributes or other information about the targeted individual, then collects relevant data from the social network and models his background knowledge. The combination of these two parts of information may be used to infer sensitive information about the user or to identify the target user in the network.
Importance of background knowledge: Background knowledge, usually some quasi-identifier, is crucial for conducting privacy attacks.

Node and Inter-Node Relationship Recognition Attacks

Methods of attack: The attacker identifies target nodes or real social relationships between nodes in a social network by modeling existing background knowledge.
Node Attribute Type：
- descriptive property: Some descriptive information, such as personal characteristics, possessed by an individual node.
- Structural Properties: Properties that arise as a result of being present throughout a social network, such as features of the graphical structure of the network.
Attack implementation: Attackers look for matching nodes in the social network for identification based on these attribute information. Also, they may utilize linking attacks on the dataset to obtain sensitive attributes of the target nodes.
Privacy: When releasing data, avoid directly releasing information that can identify nodes, such as ID numbers and other sensitive data.

Affiliation Attack

Affiliation attacks in social networks use an individual’s social connections in the network, such as friends, alumni, and coworkers, to commit privacy violations.

Social networking for social connections: Individuals choose to join specific club organizations or participate in group activities by connecting with friends, alumni, colleagues, or others with common characteristics around them.
case study: For example, on Weibo, users can select relevant topics based on the events and people they follow, or share concert photos or videos with their fan groups.
privacy risk: Interest activity groups may raise security concerns while providing convenience. Individuals have a higher probability of having similar attributes in the same group, which can be exploited by attackers to decrypt sensitive information.
means of attack: An affiliation identification attack is when an attacker infers whether an individual target node belongs to a group or has a public attribute.

probabilistic attack

Probabilistic attacks are a means of privacy invasion with uncertainty.

Attack Definition: In the published dataset, the attacker uses probabilistic methods to identify sensitive attributes and information of individuals rather than accurately identifying specific target nodes.
Analysis of possibilities: The existence of nodes and the existence of relationships between nodes in a social network can be inferred by probabilistic attacks.
Examples of attacks: Even in social network datasets where certain relationships between nodes are hidden for privacy protection, an attacker can still utilize existing background knowledge to perform probabilistic attacks to infer the existence of a particular relationship between target nodes.

State of the Art in Privacy Protection Research

In this section, the current state of domestic and international research on social network user privacy attack methods and social network privacy protection mechanisms will be summarized from two aspects.

State of the Art of Research on User Privacy Attacks on Social Networks

Attacks on social networks can usually be divided into two categories: identity attacks and attribute attacks.

identity attack: Aims to confirm the identity of users in social network data.
Attribute Attack: Aims to predict users’ sensitive attributes, such as sexual orientation, political leanings, etc., based on published social network data.
Key technologies: Social network de-anonymization techniques and inference attack techniques are the dominant techniques for these two attacks.

Social network de-anonymization techniques

Data anonymization operations: Before releasing data containing sensitive information, data owners usually perform anonymization operations, such as deleting identifying information such as usernames and phone numbers, and modifying the topology of the social network.
anonymity chart: Processed social network graph data is known as anonymized graphs.
Behavior of the attacker: Acquisition of auxiliary charts containing user identification information by means of purchase or hacking.

Seed-based de-anonymization

technical assumption: The attacker knows the corresponding points in the auxiliary graph for some of the users in the anonymized graph and uses these points as seeds.
Examples of technical applications：
- Korula is anonymized using a propagation algorithm.
- Chiasserini et al. used bootstrap percolation theory and graph partitioning techniques to design de-anonymization algorithms.

Non-seed based de-anonymization

technical assumption: The attacker does not know the seed pairs directly, but obtains them indirectly using information other than topological information.
Examples of technical applications：
- Backstrom et al. assume that an attacker can generate fake accounts to connect to nodes in an anonymized graph, and then identify the subgraphs to which these fake accounts are connected.
- The heuristic propagation algorithm proposed by Narayanan achieves full graph de-anonymization based on known seed pairs.

Development of de-anonymization techniques for social networks

Seedless de-anonymization: New techniques have been proposed by researchers that rely only on topological information of the auxiliary graph, such as the Bayesian method-based matching technique of Pedarsani et al.
Supporting information in addition to structural information: Methods to enhance the effect of de-anonymization using community information, attribute information, knowledge graphs, etc.

Social Network Inference Attack Techniques

Social network inference attack techniques are based on the rich user information in social networks, where attackers analyze public data to infer private information about users.

Reasoning Attack Cases

reasoning by political inclination: For example, by analyzing a user’s social network friends, an attacker can surmise that user’s political leanings.
Sexual orientation prediction: Jerigan trained a logistic regression classifier to predict a user’s sexual orientation by analyzing Facebook user data.
geographic inference: Liu designed a two-stage model to integrate users’ connectivity information and social network behavioral records for predicting the city in which they live.
Predictive modeling of political preferences: Lindmood proposed an improved plain Bayesian model for predicting users’ political preferences based on their attributes and connection information.

Development of Inferential Attack Techniques

Applications of rough set theory: Cai et al. proposed a method for sensitive attribute prediction using non-sensitive attributes and buddy relationships based on rough set theory.
Attacks on non-social network users: CarCia’s research shows that even if someone is not a user of a social network, an attacker can use public information about their real-life contacts’ social networks to conduct an inference attack.
Machine Learning in Inference Attacks: Graph embedding vectors are widely used as social network information representations in network analysis tasks.Ellers et al. proposed an attack model against graph embedded data that can reason about deleted sensitive accounts and their neighboring nodes.

Current status of research on privacy protection of social network users

In recent years, a large number of researches have been carried out in academia to protect the privacy of social network users. These researches are mainly divided into two categories: anonymization techniques to resist authentication attacks and anti-inference attack techniques to block sensitive attribute attacks.

Social network anonymization techniques

k-degree anonymization

Definitions and methodology: Liu et al. define k-degree anonymization as ensuring that any point in a graph has at least k-1 other points with the same degree as it. Their proposed two-stage k-degree anonymization scheme first anonymizes the degree sequence of the graph and then generates graph data with similar structure.

k-neighborhood anonymization

Concept and construction: Zhou et al. proposed k-neighborhood anonymization for attackers who may know the subgraph between the target node and its neighbors. The method groups users in similar neighborhoods by encoding their single-hop neighborhood nodes and anonymizes each group.

k isomorphic anonymization

Definition and implementation: Cheng et al. proposed the k-isomorphic anonymization method, which achieves anonymization by dividing the graph into k isomorphic subgraphs.

k-anonymization of time-varying graphs

Research and Algorithms: The k-anonymization problem for time-varying graphs was explored and algorithms were proposed by Rossi et al.

Anonymization based on graph embedding

Improved mechanisms: Zhang et al. address the weaknesses of existing anonymization mechanisms by proposing a graph embedding-based technique to generate false edges that are more difficult to identify.

Social Networking Anti-Inference Attack Techniques

Strategies for stopping inference attacks

Attribute and edge adjustments: Cain et al. blocked the attack by removing or adding noise to attributes and edges that are highly correlated with privacy properties.
Data Cleaning Strategy: He et al. find a data cleaning strategy by solving an optimization problem to achieve a balance between data utility and privacy.
Adversarial Learning Noise Generation: Jia designed an adversarial learning-based noise generation mechanism for adding noise against inference attacks, but his research did not involve adding noise to edge data.

Common privacy-preserving models of anonymization

This chapter discusses k-anonymity and its derived models, which are important results within the field of privacy protection.

k-anonymous

Origin story: The k-anonymity model originated from a user privacy breach in the 1990s in Massachusetts, U.S.A. Sweeney successfully cracked anonymized healthcare data and found that 87% of Americans could be identified using just the gender, date of birth, and zip code triad.
Model Definition: Proposed by Samarati and Sweeney in 2002. k-anonymity requires that each record in the published data be indistinguishable from at least k-1 other records.
Effectiveness and limitations: k-anonymized data makes it impossible for an attacker to determine personal information with certainty, but an increase in the value of k decreases the usability of the data. machanavajjhala et al. point out that k-anonymization does not constrain sensitive attributes, which may lead to privacy breaches.

Improved k-anonymity model

l-diversity: To prevent consistency attacks, l-diversity ensures that sensitive attributes in any equivalence class have at least l distinct values.
t-proximity: Based on l-diversity, the distribution of the sensitive attribute in all equivalence classes is required to be close to the global distribution of the attribute.
(a, k)- Anonymous: On a k-anonymized basis, it is guaranteed that the percentage of records associated with any sensitive attribute value is no higher than a in each equivalence class.

Challenges of k-anonymity modeling

continuous evolution: Despite the improvements made by the above models in improving privacy protection, they still have shortcomings and the traditional privacy-preserving models face constant challenges with the emergence of new attack methods.
Limitations of assumptions: These models are based on assumptions about the background knowledge of the attacker and the attack model that do not always hold in reality.
differential privacy: It was not until the advent of differential privacy that the fundamental problems of these privacy-preserving models were better addressed.

k-anonymity: examples and attack methods

k-anonymization is an important method of data anonymization. Below are examples of k-anonymization and potential methods of attack.

k-anonymous example

Classification of Public Attributes
- identifiers: Information that uniquely identifies an individual, such as name, address, phone number, etc., needs to be removed when the data is made public.
- quasi-identifier: Identifiers that are not unique but help with data management, e.g., zip code, age, birthday, etc.
- sensitive data: Data of interest to researchers, such as purchasing preferences, salaries, etc.
k-anonymized models: Ensure that the specified identifier or quasi-identifier attribute value contains at least k records in each equivalence class, thus protecting individual privacy.
Method of implementation
- Delete the corresponding data column and replace it with a *.
- The use of generalization makes the data indistinguishable, e.g., converting age to age range.

k-anonymity attack methods

Unsorted Match Attack
- The attacker guesses the attribution of the anonymized records by comparing the order of the data records with the original records.
- Defense: disrupting the original order of publicly available data.
homogenization attack
- If the values of the sensitive attributes within the k-anonymous group are the same, an attacker can easily obtain the information.
Background knowledge attack
- Even if the sensitive attributes are not the same within the k-anonymized group, an attacker can use the existing background knowledge to make inferences.
Supplementary data attacks
- If the same data is made public multiple times after being k-anonymized in different ways, an attacker can infer user information through data correlation.

Clustering-based privacy-preserving algorithms

Clustering based privacy preserving algorithms play a key role in the processing of social network data. The main idea, workflow and problem description of the algorithm is given below.

Main Ideas and Workflow

clustering: The nodes of a social network are clustered based on the combined distance between nodes to form multiple hyperpoints. The internal details of the hyperpoints are hidden and the connections between hyperpoints are reduced to a single edge.
Key Steps：
1. Generalization of structure and attribute information: To better defend against background knowledge attacks, inter-node distances are calculated by combining structural and attribute information.
2. Clustering Algorithm Optimization: Optimizing the selection of initial seed nodes for social network characteristics to improve clustering quality and reduce information loss.
3. Privacy-preserving strength adaptive algorithm: Use different protection strengths based on the difference in privacy protection strengths required by each node to reduce information loss while improving data security.

Description of the problem

Macro data accuracy: Social network analysis needs to ensure that the data is accurate at the macro level, and inaccuracies in localized information should not affect the study of macro characteristics.
Limitations of existing algorithms：
1. Subgraph k-based anonymization model: The model assumes that the attacker uses the subgraphs in the social network as the background knowledge for the attack. However, in reality, the background knowledge acquired by the attacker is much richer and more complex.
2. Privacy protection strength issues: Most algorithms use the same level of privacy protection for all nodes, ignoring the differences in privacy needs of different nodes.

Problem modeling

Attribute Generalization Definitions: In anonymized social networks, all attribute values of nodes within a cluster are generalized, replacing specific values with broader values.
generalization process：
- Data type distinction: Numeric (e.g., age, income) and non-numeric data require different generalization methods.
- Numeric Data Generalization: Replace specific values with ranges of values to minimize loss of information.

Design of clustering-based privacy-preserving algorithms for social networks

Clustering-based privacy-preserving algorithms for social networks are an important privacy-preserving tool, especially when dealing with large-scale social network data. The following are the key aspects and ideas of the algorithm design.

Algorithm Design Overview

Based on k-means clustering algorithm: This algorithm clusters the social network nodes using k-means clustering method to generate a number of hyperpoints.
clustering process：
1. initial clustering: Select representative seed nodes and cluster the unassigned nodes based on the nearest cluster until all nodes are assigned.
2. redistribution: Redistribute the graph after the initial clustering to ensure that all nodes satisfy their privacy preserving needs.
3. anonymization: Anonymize all super nodes.

Initial seed node optimization algorithm

Characteristics of the k-means algorithm: Sensitive to initial clustering centers and needs to be optimized with social network characteristics.
Clustering coefficients and densities: Combining the local clustering coefficient and the density of user nodes, nodes with a high degree of aggregation are selected as the initial centers.
Selection method: Sorting the set of points is based on the aggregation density and the largest or distant nodes are selected as initial centers to improve the clustering effect.

Privacy protection strength adaptive

Core and non-core points: The nodes are categorized into core points (requiring high privacy protection) and non-core points (with lower requirements) based on their privacy protection needs.
Adjustments to the level of protection: Nodes with high privacy protection increase security, while nodes with low privacy requirements reduce information loss and increase data validity.
Core Node Judgment: Determine whether a node is a core node or not based on the density of nodes around the node and the initial clustering.

outline questions

1. List a few cases of privacy breaches on social networks.

A few cases of social network privacy breaches include:

Sharing the original image exposes the address: a member of the public shared a photo of an excursion to Yuyuantan to Weibo, leading to the discovery of its geographic location by his best friend.
Friend circle information leakage experiment: strangers added through the WeChat Shake function quickly obtained a large amount of personal information from their friend circle.
Sexually assaulted after posting random pictures exposing her address: In Liaoning, a 23-year-old girl, Zhang Di, was victimized after her WeChat photo album was used by a suspect.
Pupil reflections in celebrity selfies give away geolocation: the pupil reflections in Japanese actress Matsuoka Yukinami’s selfies are being used by fans to locate her.
The TV variety show “Super Brain Teenagers” unlocked specific address and flight information from an aerial photo.

These cases reflect how personal information on social networks can be utilized by others, leading to serious privacy breaches and security risks.

2. Different application scenarios define privacy differently; what attributes of social network data might be called private information in the network?

Attributes in social network data that may be considered private include:

Vertex Existence and Properties: The existence of individuals in a social network and their attributes, such as the degree of vertices, can be considered as personal privacy.
Sensitive Vertex Labeling: Certain tag attributes of individuals in social networks, such as time of commuting to work, route to work, etc., which are sensitive tag attributes are considered to be private.
link relationship: Edges between vertices in a social network represent relationships between social individuals, and such linking relationships may be private to be protected.
Sensitive Edge Labeling: The attributes that edges between individuals have are considered as labels. The values of these sensitive edge labels are sometimes considered as targets for research.
Graphical parameters: Things such as intermediacy, proximity, centrality, path length, reachability, etc., which indicate a node’s relationship with or position in the social network, can be considered as sensitive information or privacy.
Link Weight: In social networks, the weights of edges can indicate the closeness of relationships between individuals, such as the closeness between friends, or even the amount of communication, which can also be considered as a form of privacy.

These attributes reflect the multidimensionality and complexity of privacy protection in social network data.

3. What are some of the ways to target privacy attacks on social networks? Describe them briefly.

The main methods of targeting privacy attacks on social networks include:

Attacks based on background knowledge: Attackers combine published social network data and background knowledge to identify targeted individuals and sensitive information.
Node and Inter-Node Relationship Recognition Attacks: Attackers use individual attribute information and structural attribute information in social networks for identification attacks.
Affiliation Attack: An attacker uses social networks to infer whether an individual target node belongs to a group or has a public attribute.
probabilistic attack: In published datasets, attackers utilize methods with a certain probabilistic nature to identify sensitive attributes and information of social individuals.
Social Network De-Anonymization Techniques and Inference Attack Techniques: An attacker obtains an auxiliary graph containing user identity information and social network topology through purchase or hacking, etc., and then performs a de-anonymization attack or an inference attack to infer sensitive attributes of the user.

4. What are social network anonymization techniques? Briefly describe k-anonymization and the attacks against this method.

Social network anonymization techniques are a set of methods used to protect the privacy of users by processing social network data. Among them, k-anonymization is a common anonymization technique that aims to

Ensure that any individual's information is at least similar to that of the other k-1 individuals as a way of preventing individuals from being identified

。 However, the k-anonymity approach faces a variety of attacks such asAttacks based on background knowledge, an attacker can utilize the additional information to distinguish or identify individuals that would otherwise be indistinguishable in the k-anonymity set. Furthermore.structural attack (computing)It is also a common approach against k-anonymity, where the attacker identifies the target individual by analyzing the structural features of the social network.

Recommended Today

[MySQL] Don’t Allow Data You Don’t Know How to Insert