공부 정리 블로그

[SMM 9] - Community Analysis(1)- Community detection(감지) 본문

대학원 수업/sns 분석

[SMM 9] - Community Analysis(1)- Community detection(감지)

따옹 2023. 6. 18. 17:20

Why analyze communities?

1. Analyzing communities helps better understand users

– Users form groups based on their interests

2. Groups provide a clear global view of user interactions

3. Some behaviors are only observable in a group setting and not on an individual level

– Some republican can agree with some democrats, but their parties can disagree

 

Social Media Communities

• Formation:

– When like-minded users on social media form a link and start interacting with each other

 

• More Formal Formation:

1. A set of at least two nodes sharing some interest, and

2. Interactions with respect to that interest.

 

• Social Media Communities

– Explicit (emic): formed by user subscriptions (의도한 것)

– Implicit (etic): implicitly formed by social interactions (인지하지 못하지만 발생한 커뮤니티)

우리가 하고자 하는 것, implicit한 것을 어떻게 detection 할까?

• Example: individuals calling Canada from the United States

커뮤니티 파악으로 상품 타겟에 맞는 광고를 해서 매출을 올릴 수 있음

• Phone operator considers them one community for promotional offers

 

• Other community names: group, cluster, cohesive subgroup, or module

 

Examples of Explicit Social Media Communities

Facebook has groups and communities. Users can

– post messages and images,

– can comment on other messages,

– can like posts, and

– can view activities of other users

 

In Google+, Circles represent communities

 

In Twitter, communities form as lists.

– Users join lists to receive information in the form of tweets

 

LinkedIn provides Groups and Associations.

– Users can join professional groups where they can post and share information related to the group

 

Finding Implicit Communities: An Example

A simple graph in which three implicit communities are found, enclosed by the dashed circles

Overlapping vs. Disjoint Communities

Implicit communities in other domains

- Protein-protein interaction networks

- World Wide Web

- the same or related topics

- Metabolic networks

- Food webs

 

What is Community Analysis?

• Community detection(감지)

– Discovering implicit communities

• Community evolution(시간에 따른 변화)

– Studying temporal evolution of communities

• Community evaluation(community를 감지한 정확도를 측정)

– Evaluating Detected Communities

 

Community evaluation

• The process of finding clusters of nodes(‘‘communities’’)

– With Strong internal connections and

Weak connections between different communities

• Ideal decomposition(쪼갬) of a large graph 이상적인 쪼갬 방법

– Completely disjoint communities

– There are no interactions between different communities.

• In practice, 실제로는

– find community partitions that are maximally decoupled.

 

Why Detecting Communities is Important?

예시)

• The club members split into two groups (gray and white)

• Disagreement between the administrator of the club (node 34관리자) and the club’s instructor (node 1사범),

• The members of one group left to start their own club

 

Why Community Detection?

1. 네트워크를 커뮤니티 별로 요약할 수 있어짐(시각화, 이해 쉬워짐)

2. privacy 보호 : 개별 노드 정보를 노출하지 않고도 특성을 나타낼 수 있음

 

Community Detection vs. Clustering

graph clustering 과 유사함

ㅇ Clustering 하나의 vector를 비슷한 유형으로 군집화

graph 구조가 아님(edge 없음, 하나의 행렬에 해당하는 내용이 됨)

• Data is often non-linked (matrix rows)

• Clustering works on the distance or similarity matrix, e.g., k-means.

• If you use k-means with adjacency matrix rows, you are only considering the ego-centric network(원형)

 

ㅇ Community detection

• Data is linked (a graph, egde 있음)

• Network data tends to be “discrete”, leading to algorithms using the graph property directly

– k-clique, quasi-clique, or edge-betweenness

 

Community Detection Algorithms

Group Users based on Group attributes (그룹 속성 기반, nework connection 기반)

Group Users based on Member attributes(멤버 속성 기반 커뮤니티)

 

 

Member-Based Community Detection

• Look at node characteristics; and

• Identify nodes with similar characteristics and consider them a community 유사한 노드 특징들을 커뮤니티로 파악

 

Node Characteristics

A. Degree

– Nodes with same (or similar) degrees are in one community

– Example: cliques

B. Reachability

– Nodes that are close (small shortest paths) are in one community

– Example: k-cliques, k-clubs, and k-clans

C. Similarity

– Similar nodes are in the same community

 

Most common subgraph searched for:

 

• Clique: graph 내 subgraph가 complete 인 것들

더이상 추가할 노드가 없음

참고 : https://analytics4everything.tistory.com/182

cf) complete graph (노드 - 노드 모두 연결)

a maximum complete subgraph in which all nodes inside the subgraph adjacent to each other

Find communities by searching for

우리가 찾고자 하는 것

he maximum clique : v1, v2, v3

1. The maximum clique(주어진 grap clique 중에서 사이즈가 가장 큰 것): the one with the largest number of vertices, or

2. All maximal cliques(사이즈 별 clique) : cliques that are not subgraphs of a larger clique; i.e., cannot be further expanded Both

 

Both problems are NP-hard

 

이것을 극복하기 위해,

 

I. Brute-Force Method

모든 가능한 subgraph를 test 해서 qlique인지 아닌지 찾아냄. 노드 100개를 테스트 하더라도 시간이 오래 걸려서 현실적이지 않음

 

k size나 더 큰 clique를 찾을 때, edge 개수가 k-1은 돼야함

• We can first prune all nodes (and edges connected to them) with degrees less than K − 1

k-1 이하는 자름

– More nodes will have degrees less than k − 1

– Prune them recursively

• For large k, many nodes are pruned as social media networks follow a power-law degree distribution

 

Maximum Clique: Pruning…

 

Example. 

Clique size 4이상인 graph를 찾기 위해서는 degree가 3 이상이 되어야 하므로, 2이하는 pruninig 함 

– STEP1 : Remove nodes 2 and 9

– STEP2 : Remove nodes 1 and 3

– STEP3 : Remove node 4

 

Even with pruning, cliques are less desirable means for finding communities 이론적으로는 유효하나 실효성이 없음

– Cliques are rare 하게 발생

– A clique of 1000 nodes, subgraph는  999x1000/2 edges를 가지고 있다

– A single edge removal destroys the clique 하나만 제거시켜도 1000개의 노드를 가진 graph가 아니게 됨

– That is less than 0.0002% of the edges! 

 

II. Relaxing Cliques

k-plex: a set of vertices & in which we have (qlique를 완화한 개념)

각각의 노드 V에서 degree는 이것 보다 커야 함 v dgree = |V|(전체 node의 dgree 수) -k

 

네트워크 안에 있는 n개의 노드중 적어도 n-k가 fully connected 되어있음을 의미

d_v >=|V| - 1

 

k가 커질 수록 k-plex를 구하기 쉬워짐

• Finding the maximum k-plex: NP-hard

2-plex인 경우 전체 subgraph 개수에서 -2를 뺀 모든 노드의 dgree 5-2 = 3 이상인 graph들의 집합

3-plex : 6-3 = 3이상인 graph들의 집합

전체 graph는 2- plex가 아님 6-2 = 4이상이어야 하므로

1-plex = clique

 

1,2,3,4, 4, 3, plex

maximum clique(complete subgraph) v2, v3, v4, v5

maximal clique : 다른 graph의 subgraph가 아니면서 나올 수 있는 maximum clique, 더 이상 추가로 확장이 불가능 함

(v1, v2, v4) (v2, v4, v6) (v1, v2, v3) (v4, v5, v6) (v2, v3, v4, v6) - > 5개 (정확하지 않음)

 

More Cliques Relaxing…

• k-core: a maximal connected subgraph in which all vertices have degree at least k

– Difference with "-plex?

모두 해당

• k-shell: nodes that are part of the k-core, but are not part of the (k + 1)-core.

하늘색 1-shell/ 빨간색 2-shell / 초록색 3-shell

 

0 core graph, 0 shell  왼쪽 노드

III. Using Cliques as a seed of a Community

Clique Percolation Method (CPM)

clique을 시드로 하여 더 큰 community detect

– Uses cliques as seeds to find larger communities

CPM finds overlapping communities

• Input

– A parameter k, and a network

clique size k 인 모든 graph로 찾고

• Procedure

– Find out all cliques of size k in the given network

– Construct a clique graph.

• Two cliques are adjacent if they share k − 1 nodes

– Each connected components in the clique graph form a community

공통 노드 2개 이하면 edge 생성 x

B. Node Reachability

The two extremes Nodes are assumed to be in the same community

Direct하게 연결돼 있으면 community

cf) 연결되어 있으면 하나의 community로 파악

1. If there is a path between them (regardless of the distance) or

How? Find using BFS/DFS

Challenge: most nodes are in one community (giant component)

하나의 component에 속하면 하나의 community(적철치 않음)

 

2. They are so close as to be immediate neighbors.

How? Finding Cliques

Challenge: Cliques are challenging to find and are rarely observed

 

Solution: find communities that are in between cliques and connected components in terms of connectivity and have small shortest paths between their nodes

 

Special Subgraphs

1. k-Clique: a maximal subgraph in which the largest shortest path distance between any nodes is less than or equal to k

가장 큰 최단거리가 k보다 작은 subgraph를 찾고 싶다. (member 상관 없음)

 

2. k-Club: follows the same definition as a k-clique 과 정의는 동일하지만

v4, v5는 path lenth 3이기 때문에 2-club아님 (member 중요)

 

3. k-Clan: a k-clique where for all shortest paths within the subgraph the distance is equal or less than K. (1, 2의 교집합)

– k-Clan = k-Cliques ⋂  k-Clubs 두 가지 성질을 모두 다 가진 경우

 

 

More Special Subgraphs!

• "k-truss: the largest subgraph where all edges belong to k − 2 triangles

4-truss : subgraph의 모든 egde는 2개의 triangle 에 속해있어야함

어떤 edge가 있을 때 k-2 개의 삼각형에 속해 있어야함

 

• What is the relationship between a "-core and "-truss?

 

C. Node Similarity 노드 유사도 측면

• Similar (or most similar) nodes are assumed to be in the same community.

비슷한 노드는 같은 커뮤니티에 속한다고 봄

– A classical clustering algorithm (e.g., k-means) is applied to node similarities to find communities.

Jaccard, Cosine유사도를 측정한 뒤 k-means

• Node similarity can be defined

– Using the similarity of node neighborhoods (Structural Equivalence) – Ch. 3

– Similarity of social circles (Regular Equivalence) – Ch. 3

Structural equivalence: two nodes are structurally equivalent iff. they are connecting to the same set of actors

Nodes 1 and 3 (이웃이 유사하기 때문에 유사한 node로 볼 수 있음) are structurally equivalent, So are nodes 5 and 6.

 

Node Similarity (Structural Equivalence)

 

Grop-Based 

노드들간의 연결을 봄

Group Properties:

I. Balanced: Spectral clustering -> 커뮤니티에 있는 노드들이 균형을 갖도록 커뮤니티 파악

II. Robust: k-connected graphs

III. Modular: Modularity Maximization

IV. Dense: Quasi-cliques

V. Hierarchical: Hierarchical clustering

 

I. Balanced Communities

graph 분할하겠음

• Community detection can be thought of graph clustering

• Graph clustering: we cut the graph into several partitions and assume these partitions represent communities

• Cut: partitioning (cut) of the graph into two (or more) sets (cutsets) 무작정 쪼개는 것이 아닌 cut을 찾겠다 

cut의 크기는 분할된 파티션을 연결하는 것이 cut size, cut edge connection 이 최소가 되도록 쪼갬

 

• Minimum cut (min-cut) problem: find a graph partition such that the number of edges between the two sets is minimized

 

Min-cuts can be computed efficiently using the max-flow min-cut theorem

Min-cut often returns an imbalanced partition, with one set being a singleton

 

Ratio Cut and Normalized Cut 

불균형이 되지 않게 cut

• To mitigate the min-cut problem we can change the objective function to consider community size

k는 내가 쪼개려는 graph의 개수

For Cut A

vol(P_i) = 전체 edge =13 x 2 + 1, 모든 노드의 dgree의 총합

For Cut B

Both ratio cut and normalized cut prefer a balanced partition

 

Spectral Clustering (cut 기반 clustering)

Reformulating ratio cut (or normalized cut) in matrix format

matrix 형태로 cut을 도출

1. X-matrix : k community의 개수

행 : 사용자 1~n까지

column : K개의 comnunity

n x k, node1이 community(k)에 속하면 1 아니면 0

column은 속하느냐 1, 아니냐 0 -> membership

 A- Matrix : 인접행렬

• D-Matrix : Let D = diag(d_1, d_2, ⋯, d_n) be the diagonal degree matrix

사용자 n 명,, 각 node의 dgree가 대각선에 위치, 나머지 0

m x n 행렬

• X^T A X (k x k matrix) / A : graph의 인접행렬, i 번 째 community 속한 edge의 개수 = 6

i  커뮤니티 안에만 존재하는 edge

• X^T D X :  comnunity i 에 연결된  모든 edge의 개수(밖에 해당하는 edge 개수) = 12

• X^T (D − A) X  : community i 와 cut에 해당하는 edge가 digonal에 포함 ( k x k matrix)

 

The i th diagonal element of X^T (D − A) X is equivalent to cut(P_i, P(bar)_i)

Both ratio/normalized cut can be reformulated as

 

ex)

L Matrix :  unnormalized matrix를 가지고 eigenvector, value 구함

9 x 9 Matrix -> 9 value, vector 나옴

lambda 1 는 동일

lambda 2 음수에 해당하는 것, 양수에 해당하는 것 / 2개의 partition으로 분할 [1, 1, 1, 0, 0, 0, 0, 0, 0] -> spectral clustering

k차원에 해당하는 clustering 가능

 

II. Robust Communities

• The goal is find subgraphs robust enough such that removing some edges or vertices does not disconnect the subgraph

• k-vertex connected (k-connected) graph: 적어도 k개의 노드를 제거해야 2개로 쪼개짐

• k-edge connected: 적어도 k개의 edge를 제거해야 함

적어도 2개가 제거 되어야 분리 가능, 4-connected graph

• Examples:

– Complete graph of size ': unique ' n-connected graph

– A cycle: 2-connected graph

 

III. Modular Communities

Consider a graph G(V, E), degree는 알지만, edge 연결은 모를 때,

– Consider two vertices v_i and v_j with degrees d_i and d_j.

– |E|전체 edge의 개수 = m

둘 사이 평균 몇개의 edge로 연결되어 있나

 

Modularity and Modularity Maximization

• Given a degree distribution, we know the expected number of edges between any pairs of vertices

• We assume that real-world networks should be far from random. Therefore, the more distant they are from this randomly generated network, the more structural they are.

random과 social graph를 비교했을 때, 차이가 클수록 좋음. 거리의 차이를 정의

• Modularity defines this distance and modularity maximization tries to maximize this distance

Modularity Maximization : 최대화 하는 방향으로 커뮤니티 ㅠㅏ악

 

Normalized Modularity

P= (P1, P2, P3,…, Pk)

p1에 속하는 노드의 집합이 무엇인지

(edge i, j가 community에 direct 연결된 값) - (node i, j 가 랜덤하게 연결했을 때 나올 수 있는 평균 기대치에 해당하는 edge경우)

차이가 커질 수록 커뮤니티 분할이 잘 되었음



최대가 되는 값을 구하는 게 Normalized Modularity

 

Modularity Maximization

 

X ∈ ℝ^(n x k)- is the indicator (partition membership) function:

X_ij = 1 iff. v_i ∈ P_j

• Similar to Spectral clustering,

• We relax + to be orthogonal, i.e., matrix Xhat

• The optimal solution for Xhat is the top k eigenvectors of B.

• To recover the original X, we can run k-means on Xhat

• Difference: in Modularity, the top ! eigenvalues have to be positive!

 

example)

L = D - A

 

d = n x n matrix 의 대각 성분(전체 degree의 총 합)/ 마지막 graph에 대한 TOP 2 eigenvector, value 구함
m : 전체 degree의 총 합

IV. Dense Communities: g-dense

graph density 정의 clique와 얼마나 유사한가를 척도로 삼음/1에 가까운 complete, 0에 가까울 수록 sparse

• The density of a graph defines how close a graph is to a clique

• A subgraph G(V, E) is a gamma-dense (or quasi-clique, ) if 0.8을 주면 complete graph 와 비교했을 때, edge 개수가 0.8이상인 것으로 정의

• A 1-dense graph is a clique

밀도가 높은 graph를 찾고 싶음(community로 파악)

 

1. Local search

• Sample a subgraph, and find a maximal g-dense quasiclique (greedy 방식을 취함)

gamma dense 한 밀도가 높은 subgraph를 찾은 다음 확장해나 감

초기값으로 설정

degree가 높은 부분에서 확장해 나가면서 gamma를 만족하지 못하면 그만

 

2. Heuristic pruning

확장된 그래프에서 degree가 낮은 노드를 제거해 나감 -> shrinking -> gamma를 만족할 때까지 찾아나감

 

V. Hierarchical Communities

cf) single level community 파악

 

– Communities may have hierarchies(어떤 커뮤니티는 계층을 가질 수 있음)

• Each community can have sub/super communities.

– Hierarchical clustering deals with this scenario and generates community hierarchies.

계층구조를 가지는 커뮤니티를 가정하고 커뮤니티 생성시킴

 

Hierarchical Community Detection

• Goal: build a hierarchical structure of communities based on network topology

• Allow the analysis of a network at different resolutions

graph 에서 n개의 node가 있을 때, 하나의 커뮤로 보거나 n개의 커뮤로 볼 수 있음 (2가지 방식)

– split (divisive hierarchical clustering) : 전체 커뮤에서 계속해서 분할해나가는 방식(top-down)

– merged (agglomerative hierarchical clustering): 각각의 커뮤를 합해나가는 방식 (bottom up)

 

Divisive Hierarchical Clustering

• Divisive clustering

– Partition nodes into several sets

– Each set is further divided into smaller ones

Network-centric partition can be applied for the partition

커다란 것을 top down으로 쪼개 감

 

• One particular example:

Girvan-Newman Algorithm: recursively remove the “weakest” links within a “community”

쪼갤 때 weak link에 대응되는 edge를 제거해나가면서 graph를 계층화시켜서 분할

 

weakest link ? 

 

Edge betweenness : edge를 통과하는 short path의 lenth 개수를 measure

값이 높으면 둘 사이의 bridgeness(해당 노드를 제거시키면 graph가 분할됨) bridge에 해당하는 edge

high betweenness : bridge between two communities

(1-2) edge가 포함된 short pa서는 2에서 {4, 5, 6, 7, 8, 9} (6개의 short path) 는 두가지 path가 있음 e(1, 2) or e(2, 3), and e(1,2) is the shortest path between 1 and 2

e(1, 2) 에서 가는 평균적인 short path는 6/2 =3개가 평균

6/2+ 1(1-2 short path 추가) = 4, betweeness = 4

 

가장 높은 betweeness e(4, 5) or e(4, 6)를 제거

이렇게 쪼개지고 위의 과정을 반복해가면서 graph 분할 -> 계층 안에서 graph community 파악됨