SocialED.detector

LDA

class SocialED.detector.lda.LDA(dataset, num_topics=50, passes=20, iterations=50, alpha='symmetric', eta=None, random_state=1, eval_every=10, chunksize=2000, file_path='../model/model_saved/LDA/')[source]

Bases: object

The LDA model for social event detection that uses Latent Dirichlet Allocation for topic modeling and event detection.

Note

This detector uses topic modeling to identify events in social media data. The model requires a dataset object with a load_data() method.

See [1] for details.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • num_topics (int, optional) – Number of topics to extract. Default: 50.

  • passes (int, optional) – Number of passes through corpus during training. Default: 20.

  • iterations (int, optional) – Maximum number of iterations through corpus. Default: 50.

  • alpha (str or float, optional) – Prior document-topic distribution. Default: 'symmetric'.

  • eta (float, optional) – Prior topic-word distribution. Default: None.

  • random_state (int, optional) – Random seed for reproducibility. Default: 1.

  • eval_every (int, optional) – Log perplexity evaluation frequency. Default: 10.

  • chunksize (int, optional) – Number of documents per training chunk. Default: 2000.

  • file_path (str, optional) – Path to save model files. Default: '../model/model_saved/LDA/'.

preprocess()[source]

Data preprocessing: tokenization, stop words removal, etc.

create_corpus(df, text_column)[source]

Create corpus and dictionary required for LDA model.

load_model()[source]

Load the LDA model from a file.

display_topics(num_words=10)[source]

Display topics generated by the LDA model.

fit()[source]
detection()[source]

Assign topics to each document and save unique ground truths and predictions to a CSV file.

evaluate(ground_truths, predictions)[source]

Evaluate the model.

BiLSTM

class SocialED.detector.bilstm.BiLSTM(dataset, lr=0.001, batch_size=1000, dropout_keep_prob=0.8, embedding_size=300, max_size=5000, seed=1, num_hidden_nodes=32, hidden_dim2=64, num_layers=1, bi_directional=True, pad_index=0, num_epochs=20, margin=3, max_len=10, file_path='../model/model_saved/Bilstm/')[source]

Bases: object

The BiLSTM model for social event detection that uses bidirectional LSTM to detect events in social media data.

Note

This detector uses bidirectional LSTM to identify events in social media data. The model requires a dataset object with a load_data() method.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • lr (float, optional) – Learning rate for optimizer. Default: 1e-3.

  • batch_size (int, optional) – Batch size for training. Default: 1000.

  • dropout_keep_prob (float, optional) – Dropout keep probability. Default: 0.8.

  • embedding_size (int, optional) – Size of word embeddings. Default: 300.

  • max_size (int, optional) – Maximum vocabulary size. Default: 5000.

  • seed (int, optional) – Random seed for reproducibility. Default: 1.

  • num_hidden_nodes (int, optional) – Number of LSTM hidden nodes. Default: 32.

  • hidden_dim2 (int, optional) – Size of second hidden layer. Default: 64.

  • num_layers (int, optional) – Number of LSTM layers. Default: 1.

  • bi_directional (bool, optional) – Whether to use bidirectional LSTM. Default: True.

  • pad_index (int, optional) – Index used for padding. Default: 0.

  • num_epochs (int, optional) – Number of training epochs. Default: 20.

  • margin (int, optional) – Margin for triplet loss. Default: 3.

  • max_len (int, optional) – Maximum sequence length. Default: 10.

  • file_path (str, optional) – Path to save model files. Default: '../model/model_saved/Bilstm/'.

preprocess()[source]

Data preprocessing: tokenization, stop words removal, etc.

split()[source]

Split the dataset into training, validation, and test sets.

load_embeddings()[source]

Load pre-trained word embeddings.

train(model, train_iterator, optimizer, loss_func, log_interval=40)[source]

Train the BiLSTM model.

evaluate(ground_truths, predictions)[source]

Evaluate the model.

run_train(epochs, model, train_iterator, test_iterator, optimizer, loss_func)[source]

Run the training and evaluation process for the BiLSTM model.

fit()[source]

Fit the model on the training data and save the best model.

detection()[source]

Detect events using the best trained model on the test data.

class SocialED.detector.bilstm.LSTM(*args: Any, **kwargs: Any)[source]

Bases: Module

init_hidden(batch_size)[source]
forward(text, text_lengths)[source]
class SocialED.detector.bilstm.VectorizeData(*args: Any, **kwargs: Any)[source]

Bases: Dataset

pad_data(tweet)[source]
class SocialED.detector.bilstm.OnlineTripletLoss(*args: Any, **kwargs: Any)[source]

Bases: Module

Online Triplets loss Takes a batch of embeddings and corresponding labels. Triplets are generated using triplet_selector object that take embeddings and targets and return indices of triplets

forward(embeddings, target)[source]
SocialED.detector.bilstm.pdist(vectors)[source]
class SocialED.detector.bilstm.TripletSelector[source]

Bases: object

Implementation should return indices of anchors, positive and negative samples return np array of shape [N_triplets x 3]

get_triplets(embeddings, labels)[source]
class SocialED.detector.bilstm.FunctionNegativeTripletSelector(margin, negative_selection_fn, cpu=True)[source]

Bases: TripletSelector

For each positive pair, takes the hardest negative sample (with the greatest triplet loss value) to create a triplet Margin should match the margin used in triplet loss. negative_selection_fn should take array of loss_values for a given anchor-positive pair and all negative samples and return a negative index for that pair

get_triplets(embeddings, labels)[source]
SocialED.detector.bilstm.random_hard_negative(loss_values)[source]
SocialED.detector.bilstm.hardest_negative(loss_values)[source]
SocialED.detector.bilstm.HardestNegativeTripletSelector(margin, cpu=False)[source]
SocialED.detector.bilstm.RandomNegativeTripletSelector(margin, cpu=False)[source]

Word2Vec

class SocialED.detector.word2vec.WORD2VEC(dataset, vector_size=100, window=5, min_count=1, sg=1, file_path='../model/model_saved/Word2vec/word2vec_model.model')[source]

Bases: object

The Word2Vec model for social event detection that uses word embeddings to detect events in social media data.

Note

This detector uses word embeddings to identify semantic relationships and detect events in social media data. The model requires a dataset object with a load_data() method.

See [2] for details.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • vector_size (int, optional) – Dimensionality of word vectors. Default: 100.

  • window (int, optional) – Maximum distance between current and predicted word. Default: 5.

  • min_count (int, optional) – Minimum word frequency. Default: 1.

  • sg (int, optional) – Training algorithm: Skip-gram (1) or CBOW (0). Default: 1.

  • file_path (str, optional) – Path to save model files. Default: '../model/model_saved/Word2vec/word2vec_model.model'.

preprocess()[source]

Data preprocessing: tokenization, stop words removal, etc.

fit()[source]

Train the Word2Vec model and save it to a file.

load_model()[source]

Load the Word2Vec model from a file.

document_vector(document)[source]

Create a document vector by averaging the Word2Vec embeddings of its words.

detection()[source]

Detect events by representing each document as the average Word2Vec embedding of its words.

evaluate(ground_truths, predictions)[source]

Evaluate the model.

GloVe

class SocialED.detector.glove.GloVe(dataset, num_clusters=50, random_state=1, file_path='../model/model_saved/GloVe/', model='../model/model_needed/glove.6B.100d.txt')[source]

Bases: object

The GloVe model for social event detection that uses GloVe word embeddings to detect events in social media data.

Note

This detector uses word embeddings to identify events in social media data. The model requires a dataset object with a load_data() method.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • num_clusters (int, optional) – Number of clusters for KMeans clustering. Default: 50.

  • random_state (int, optional) – Random seed for reproducibility. Default: 1.

  • file_path (str, optional) – Path to save model files. Default: '../model/model_saved/GloVe/'.

  • model (str, optional) – Path to pre-trained GloVe word vectors file. Default: '../model/model_needed/glove.6B.100d.txt'.

load_glove_vectors()[source]

Load GloVe pre-trained word vectors.

preprocess()[source]

Data preprocessing: tokenization, stop words removal, etc.

text_to_glove_vector(text, embedding_dim=100)[source]

Convert text to GloVe vector representation.

create_vectors(df, text_column)[source]

Create GloVe vectors for each document.

load_model()[source]

Load the KMeans model from a file.

fit()[source]
detection()[source]

Assign clusters to each document.

evaluate(ground_truths, predictions)[source]

Evaluate the model.

WMD

class SocialED.detector.wmd.WMD(dataset, vector_size=100, window=5, min_count=1, sg=1, num_best=5, threshold=0.6, batch_size=1000, n_workers=None, file_path='../model/model_saved/WMD/')[source]

Bases: object

The WMD model for social event detection that uses Word Mover’s Distance to measure document similarity and detect events.

Note

This detector uses word embeddings and Word Mover’s Distance to identify similar documents and detect events in social media data. The model requires a dataset object with a load_data() method.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • vector_size (int, optional) – Dimensionality of word vectors. Default: 100.

  • window (int, optional) – Maximum distance between current and predicted word. Default: 5.

  • min_count (int, optional) – Minimum word frequency. Default: 1.

  • sg (int, optional) – Training algorithm: Skip-gram (1) or CBOW (0). Default: 1.

  • num_best (int, optional) – Number of best matches to return. Default: 5.

  • threshold (float, optional) – Similarity threshold for event detection. Default: 0.6.

  • batch_size (int, optional) – Batch size for processing. Default: 1000.

  • n_workers (int, optional) – Number of worker processes. Default: CPU count - 1.

  • file_path (str, optional) – Path to save model files. Default: '../model/model_saved/WMD/'.

preprocess()[source]

优化的数据预处理

fit()[source]

Train the Word2Vec model and save it to a file.

detection()[source]

优化的事件检测

_save_results(ground_truths, predictions)[source]

保存结果的辅助方法

evaluate(ground_truths, predictions)[source]

Evaluate the model and save results.

SocialED.detector.wmd.process_document(doc, instance, train_df, threshold, num_best)[source]

Bert

class SocialED.detector.bert.BERT(dataset, model_name='../model/model_needed/bert-base-uncased', max_length=128, df=None, train_df=None, test_df=None)[source]

Bases: object

The BERT model for social event detection that uses BERT embeddings to detect events in social media data.

Note

This detector uses BERT embeddings to identify events in social media data. The model requires a dataset object with a load_data() method.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • model_name (str, optional) – Path to pretrained BERT model or name from HuggingFace. If path doesn’t exist, defaults to ‘bert-base-uncased’. Default: '../model/model_needed/bert-base-uncased'.

  • max_length (int, optional) – Maximum sequence length for BERT tokenizer. Longer sequences will be truncated. Default: 128.

  • df (pandas.DataFrame, optional) – Preprocessed dataframe. If None, will be created during preprocessing. Default: None.

  • train_df (pandas.DataFrame, optional) – Training data split. If None, will be created during model fitting. Default: None.

  • test_df (pandas.DataFrame, optional) – Test data split. If None, will be created during model fitting. Default: None.

preprocess()[source]

Data preprocessing: tokenization, stop words removal, etc.

get_bert_embeddings(text)[source]

Get BERT embeddings for a given text.

fit()[source]
detection()[source]

Detect events by comparing BERT embeddings.

evaluate(ground_truths, predictions)[source]

Evaluate the BERT-based model.

SBert

class SocialED.detector.sbert.SBERT(dataset, model_name='../model/model_needed/paraphrase-MiniLM-L6-v2', df=None, train_df=None, test_df=None)[source]

Bases: object

The SBERT model for social event detection that uses Sentence-BERT for text embedding and event detection.

Note

This detector uses Sentence-BERT to generate text embeddings for identifying events in social media data. The model requires a dataset object with a load_data() method.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • model_name (str, optional) – Path or name of the SBERT model to use. Default: '../model/model_needed/paraphrase-MiniLM-L6-v2'

  • df (pandas.DataFrame, optional) – Processed dataframe. Default: None

  • train_df (pandas.DataFrame, optional) – Training dataframe. Default: None

  • test_df (pandas.DataFrame, optional) – Test dataframe. Default: None

preprocess()[source]

Data preprocessing: tokenization, stop words removal, etc.

get_sbert_embeddings(text)[source]

Get SBERT embeddings for a given text.

fit()[source]
detection()[source]

Detect events by comparing SBERT embeddings.

evaluate(ground_truths, predictions)[source]

Evaluate the model.

EventX

class SocialED.detector.eventx.EventX(dataset, file_path='../model/model_saved/eventX/', num_repeats=5, min_cooccur_time=2, min_prob=0.15, max_kw_num=3)[source]

Bases: object

The EventX model for social event detection that extracts events from breaking news using keyword co-occurrence and graph-based clustering.

Note

This detector uses keyword co-occurrence and graph-based clustering to identify events in social media data. The model requires a dataset object with a load_data() method.

See [3] for details.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • file_path (str, optional) – Path to save model files. Default: '../model/model_saved/eventX/'.

  • num_repeats (int, optional) – Number of times to repeat keyword extraction. Default: 5.

  • min_cooccur_time (int, optional) – Minimum number of times keywords must co-occur. Default: 2.

  • min_prob (float, optional) – Minimum probability threshold for keyword selection. Default: 0.15.

  • max_kw_num (int, optional) – Maximum number of keywords to extract per document. Default: 3.

preprocess()[source]
split()[source]

Split the dataset into training, validation, and test sets.

fit()[source]
detection()[source]
evaluate(ground_truths, predictions)[source]

Evaluate the model.

construct_dict(df, dir_path=None)[source]
map_dicts(kw_pair_dict, kw_dict, dir_path=None)[source]
construct_kw_graph(kw_pair_dict, kw_dict, min_cooccur_time, min_prob)[source]
detect_kw_communities_iter(G, communities, kw_pair_dict, kw_dict, max_kw_num=3)[source]
map_communities(communities, map_kw_to_index)[source]
classify_docs(test_tweets, m_communities, map_kw_to_index, dir_path=None)[source]
map_tweets(df, dir_path=None)[source]
SocialED.detector.eventx.detect_kw_communities(G, communities, kw_pair_dict, kw_dict, max_kw_num=3)[source]

CLKD

class SocialED.detector.clkd.CLKD(dataset, n_epochs=1, n_infer_epochs=0, window_size=3, patience=5, margin=3.0, lr=0.001, batch_size=2000, n_neighbors=800, word_embedding_dim=300, hidden_dim=8, out_dim=32, num_heads=4, use_residual=True, validation_percent=0.1, test_percent=0.2, use_hardest_neg=False, metrics='ami', use_cuda=False, gpuid=0, mask_path=None, log_interval=10, is_incremental=False, mutual=False, mode=0, add_mapping=False, data_path='../model/model_saved/clkd/English', file_path='../model/model_saved/clkd', Tmodel_path='../model/model_saved/clkd/English/Tmodel/', lang='French', Tealang='English', t=1, data_path1='../model/model_saved/clkd/English', data_path2='../model/model_saved/clkd/French', lang1='English', lang2='French', e=0, mt=0.5, rd=0.1, is_static=False, graph_lang='English', tgtlang='French', days=7, initial_lang='French', TransLinear=True, tgt='English', embpath='../model/model_saved/clkd/dictrans/fr-en-for.npy', wordpath='../model/model_saved/clkd/dictrans/wordsFrench.txt')[source]

Bases: object

The CLKD (Contrastive Learning with Knowledge Distillation) model for social event detection.

Note

This detector uses contrastive learning and knowledge distillation to identify events in social media data. The model requires a dataset object with a load_data() method.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • n_epochs (int, optional) – Number of training epochs. Default: 1.

  • n_infer_epochs (int, optional) – Number of inference epochs. Default: 0.

  • window_size (int, optional) – Size of sliding window for incremental learning. Default: 3.

  • patience (int, optional) – Number of epochs to wait before early stopping. Default: 5.

  • margin (float, optional) – Margin for triplet loss. Default: 3.0.

  • lr (float, optional) – Learning rate. Default: 1e-3.

  • batch_size (int, optional) – Mini-batch size. Default: 2000.

  • n_neighbors (int, optional) – Number of neighbors for graph construction. Default: 800.

  • word_embedding_dim (int, optional) – Dimension of word embeddings. Default: 300.

  • hidden_dim (int, optional) – Hidden layer dimension. Default: 8.

  • out_dim (int, optional) – Output dimension. Default: 32.

  • num_heads (int, optional) – Number of attention heads. Default: 4.

  • use_residual (bool, optional) – Whether to use residual connections. Default: True.

  • validation_percent (float, optional) – Percentage of data for validation. Default: 0.1.

  • test_percent (float, optional) – Percentage of data for testing. Default: 0.2.

  • use_hardest_neg (bool, optional) – Whether to use hardest negative mining. Default: False.

  • metrics (str, optional) – Evaluation metric to use. Default: 'ami'.

  • use_cuda (bool, optional) – Whether to use GPU acceleration. Default: False.

  • gpuid (int, optional) – ID of GPU to use. Default: 0.

  • mask_path (str, optional) – Path to attention mask file. Default: None.

  • log_interval (int, optional) – Number of steps between logging. Default: 10.

  • is_incremental (bool, optional) – Whether to use incremental learning. Default: False.

  • mutual (bool, optional) – Whether to use mutual learning. Default: False.

  • mode (int, optional) – Training mode. Default: 0.

  • add_mapping (bool, optional) – Whether to add mapping layer. Default: False.

  • data_path (str, optional) – Path to data directory. Default: '../model/model_saved/clkd/English'.

  • file_path (str, optional) – Path to save files. Default: '../model/model_saved/clkd'.

  • Tmodel_path (str, optional) – Path to teacher model. Default: '../model/model_saved/clkd/English/Tmodel/'.

  • lang (str, optional) – Language of the data. Default: 'French'.

  • Tealang (str, optional) – Language of teacher model. Default: 'English'.

  • t (float, optional) – Temperature parameter. Default: 1.

  • data_path1 (str, optional) – Path to first language data. Default: '../model/model_saved/clkd/English'.

  • data_path2 (str, optional) – Path to second language data. Default: '../model/model_saved/clkd/French'.

  • lang1 (str, optional) – First language. Default: 'English'.

  • lang2 (str, optional) – Second language. Default: 'French'.

  • e (float, optional) – Epsilon parameter. Default: 0.

  • mt (float, optional) – Momentum parameter. Default: 0.5.

  • rd (float, optional) – Random drop rate. Default: 0.1.

  • is_static (bool, optional) – Whether to use static embeddings. Default: False.

  • graph_lang (str, optional) – Language for graph construction. Default: 'English'.

  • tgtlang (str, optional) – Target language. Default: 'French'.

  • days (int, optional) – Number of days for temporal window. Default: 7.

  • initial_lang (str, optional) – Initial language. Default: 'French'.

  • TransLinear (bool, optional) – Whether to use linear transformation. Default: True.

  • tgt (str, optional) – Target language code. Default: 'English'.

  • embpath (str, optional) – Path to embedding file. Default: '../model/model_saved/clkd/dictrans/fr-en-for.npy'.

  • wordpath (str, optional) – Path to word dictionary. Default: '../model/model_saved/clkd/dictrans/wordsFrench.txt'.

preprocess()[source]
fit()[source]
detection()[source]
evaluate(predictions, ground_truths)[source]
class SocialED.detector.clkd.Preprocessor(args)[source]

Bases: object

generate_initial_features()[source]
documents_to_features(df, initial_lang)[source]
get_word2id_emb(wordpath, embpath)[source]
nonlinear_transform_features(wordpath, embpath, df)[source]
getlinear_transform_features(features, src, tgt)[source]
extract_time_feature(t_str)[source]
df_to_t_features(df)[source]
construct_graph()[source]
construct_graph_from_df(df, G=None)[source]
networkx_to_dgl_graph(G, save_path=None)[source]
construct_incremental_dataset(args, df, save_path, features, nfeatures, test=False)[source]
SocialED.detector.clkd.infer(train_i, i, data_split, metrics, embedding_save_path, loss_fn, model=None)[source]
SocialED.detector.clkd.mutual_infer(embedding_save_path1, embedding_save_path2, data_split1, data_split2, train_i, i, loss_fn, metrics, model1, model2, device)[source]
SocialED.detector.clkd.mutual_train(embedding_save_path1, embedding_save_path2, data_split1, data_split2, train_i, i, loss_fn, metrics, device)[source]
SocialED.detector.clkd.initial_maintain(train_i, i, data_split, metrics, embedding_save_path, loss_fn, model=None)[source]
SocialED.detector.clkd.generateMasks(length, data_split, train_i, i, validation_percent=0.1, test_percent=0.2, save_path=None)[source]
SocialED.detector.clkd.getdata(embedding_save_path, data_path, data_split, train_i, i, args, src=None, tgt=None)[source]
SocialED.detector.clkd.extract_embeddings(g, model, num_all_samples, labels, args, device)[source]
SocialED.detector.clkd.mutual_extract_embeddings(g, model, peer, src, tgt, num_all_samples, labels, args, device)[source]
SocialED.detector.clkd.save_embeddings(extract_nids, extract_features, extract_labels, extract_train_tags, path, counter)[source]
SocialED.detector.clkd.intersection(lst1, lst2)[source]
SocialED.detector.clkd.run_kmeans(extract_features, extract_labels, indices, metric, isoPath=None)[source]
SocialED.detector.clkd.evaluate_model(extract_features, extract_labels, indices, epoch, num_isolated_nodes, save_path, metrics, is_validation=True, file_name='evaluate.txt')[source]
class SocialED.detector.clkd.Metric[source]

Bases: object

reset()[source]
value()[source]
name()[source]
class SocialED.detector.clkd.AccumulatedAccuracyMetric[source]

Bases: Metric

Works with classification model

reset()[source]
value()[source]
name()[source]
class SocialED.detector.clkd.AverageNonzeroTripletsMetric[source]

Bases: Metric

Counts average number of nonzero triplets found in minibatches

reset()[source]
value()[source]
name()[source]
class SocialED.detector.clkd.OnlineTripletLoss(*args: Any, **kwargs: Any)[source]

Bases: Module

Online Triplets loss Takes a batch of embeddings and corresponding labels. Triplets are generated using triplet_selector object that take embeddings and targets and return indices of triplets

forward(embeddings, target, rd, peer_embeddings=None)[source]
SocialED.detector.clkd.pdist(vectors)[source]
class SocialED.detector.clkd.TripletSelector[source]

Bases: object

Implementation should return indices of anchors, positive and negative samples return np array of shape [N_triplets x 3]

get_triplets(embeddings, labels)[source]
class SocialED.detector.clkd.FunctionNegativeTripletSelector(margin, negative_selection_fn, cpu=True)[source]

Bases: TripletSelector

For each positive pair, takes the hardest negative sample (with the greatest triplet loss value) to create a triplet Margin should match the margin used in triplet loss. negative_selection_fn should take array of loss_values for a given anchor-positive pair and all negative samples and return a negative index for that pair

get_triplets(embeddings, labels)[source]
SocialED.detector.clkd.random_hard_negative(loss_values)[source]
SocialED.detector.clkd.HardestNegativeTripletSelector(margin, cpu=False)[source]
SocialED.detector.clkd.RandomNegativeTripletSelector(margin, cpu=False)[source]
class SocialED.detector.clkd.GATLayer(*args: Any, **kwargs: Any)[source]

Bases: Module

reset_parameters()[source]

Reinitialize learnable parameters.

edge_attention(edges)[source]
message_func(edges)[source]
reduce_func(nodes)[source]
forward(blocks, layer_id)[source]
class SocialED.detector.clkd.MultiHeadGATLayer(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(blocks, layer_id)[source]
class SocialED.detector.clkd.GAT(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(blocks, args, trans=False, src=None, tgt=None)[source]
class SocialED.detector.clkd.Arabic_preprocessor(tokenizer, **cfg)[source]

Bases: object

clean_text(text)[source]
class SocialED.detector.clkd.SocialDataset(*args: Any, **kwargs: Any)[source]

Bases: Dataset

load_adj_matrix(path, index)[source]
remove_obsolete_nodes(indices_to_remove=None)[source]
SocialED.detector.clkd.graph_statistics(G, save_path)[source]

KPGNN

class SocialED.detector.kpgnn.KPGNN(dataset, n_epochs=15, n_infer_epochs=0, window_size=3, patience=5, margin=3.0, lr=0.001, batch_size=200, n_neighbors=800, hidden_dim=8, out_dim=32, num_heads=4, use_residual=True, validation_percent=0.2, use_hardest_neg=False, use_dgi=False, remove_obsolete=2, is_incremental=False, use_cuda=False, data_path='../model/model_saved/kpgnn/kpgnn_incremental_test', mask_path=None, resume_path=None, resume_point=0, resume_current=True, log_interval=10)[source]

Bases: object

The KPGNN model for social event detection that uses knowledge-preserving graph neural networks for event detection.

Note

This detector uses graph neural networks with knowledge preservation to identify events in social media data. The model requires a dataset object with a load_data() method.

See [4] for details.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • n_epochs (int, optional) – Number of training epochs. Default: 15.

  • n_infer_epochs (int, optional) – Number of inference epochs. Default: 0.

  • window_size (int, optional) – Size of sliding window. Default: 3.

  • patience (int, optional) – Early stopping patience. Default: 5.

  • margin (float, optional) – Margin for triplet loss. Default: 3.0.

  • lr (float, optional) – Learning rate for optimizer. Default: 1e-3.

  • batch_size (int, optional) – Batch size for training. Default: 200.

  • n_neighbors (int, optional) – Number of neighbors to sample. Default: 800.

  • hidden_dim (int, optional) – Hidden layer dimension. Default: 8.

  • out_dim (int, optional) – Output dimension. Default: 32.

  • num_heads (int, optional) – Number of attention heads. Default: 4.

  • use_residual (bool, optional) – Whether to use residual connections. Default: True.

  • validation_percent (float, optional) – Percentage of data for validation. Default: 0.2.

  • use_hardest_neg (bool, optional) – Whether to use hardest negative mining. Default: False.

  • use_dgi (bool, optional) – Whether to use deep graph infomax. Default: False.

  • remove_obsolete (int, optional) – Number of epochs before removing obsolete data. Default: 2.

  • is_incremental (bool, optional) – Whether to use incremental learning. Default: False.

  • use_cuda (bool, optional) – Whether to use GPU acceleration. Default: False.

  • data_path (str, optional) – Path to save model data. Default: '../model/model_saved/kpgnn/kpgnn_incremental_test'.

  • mask_path (str, optional) – Path to mask file. Default: None.

  • resume_path (str, optional) – Path to resume training from. Default: None.

  • resume_point (int, optional) – Epoch to resume from. Default: 0.

  • resume_current (bool, optional) – Whether to resume from current state. Default: True.

  • log_interval (int, optional) – Number of steps between logging. Default: 10.

preprocess()[source]
fit()[source]
detection()[source]
evaluate(predictions, ground_truths)[source]
class SocialED.detector.kpgnn.Preprocessor(dataset)[source]

Bases: object

generate_initial_features(dataset)[source]
documents_to_features(df)[source]
extract_time_feature(t_str)[source]
df_to_t_features(df)[source]
custom_message_graph(dataset)[source]
construct_graph_from_df(df, G=None)[source]
networkx_to_dgl_graph(G, save_path=None)[source]
construct_incremental_dataset(df, save_path, features, test=True)[source]
class SocialED.detector.kpgnn.KPGNN_model(args)[source]

Bases: object

infer(train_i, i, data_split, metrics, embedding_save_path, loss_fn, train_indices=None, model=None, loss_fn_dgi=None, indices_to_remove=[])[source]
initial_maintain(train_i, i, data_split, metrics, embedding_save_path, loss_fn, model=None, loss_fn_dgi=None)[source]
SocialED.detector.kpgnn.graph_statistics(G, save_path)[source]
SocialED.detector.kpgnn.generateMasks(length, data_split, train_i, i, validation_percent=0.2, save_path=None, num_indices_to_remove=0)[source]

Intro: This function generates train and validation indices for initial/maintenance epochs and test indices for inference(prediction) epochs If remove_obsolete mode 0 or 1: For initial/maintenance epochs: - The first (train_i + 1) blocks (blocks 0, …, train_i) are used as training set (with explicit labels) - Randomly sample validation_percent of the training indices as validation indices For inference(prediction) epochs: - The (i + 1)th block (block i) is used as test set Note that other blocks (block train_i + 1, …, i - 1) are also in the graph (without explicit labels, only their features and structural info are leveraged) If remove_obsolete mode 2: For initial/maintenance epochs: - The (i + 1) = (train_i + 1)th block (block train_i = i) is used as training set (with explicit labels) - Randomly sample validation_percent of the training indices as validation indices For inference(prediction) epochs: - The (i + 1)th block (block i) is used as test set

Parameters:
  • length – the length of label list

  • data_split – loaded splited data (generated in custom_message_graph.py)

  • i (train_i,) – flag, indicating for initial/maintenance stage if train_i == i and inference stage for others

  • validation_percent – the percent of validation data occupied in whole dataset

  • save_path – path to save data

  • num_indices_to_remove – number of indices ought to be removed

:returns train indices, validation indices or test indices

SocialED.detector.kpgnn.extract_embeddings(g, model, num_all_samples, labels)[source]
SocialED.detector.kpgnn.save_embeddings(extract_nids, extract_features, extract_labels, extract_train_tags, path, counter)[source]
SocialED.detector.kpgnn.intersection(lst1, lst2)[source]
SocialED.detector.kpgnn.run_kmeans(extract_features, extract_labels, indices, isoPath=None)[source]
SocialED.detector.kpgnn.evaluate_model(extract_features, extract_labels, indices, epoch, num_isolated_nodes, save_path, is_validation=True)[source]
class SocialED.detector.kpgnn.Metric[source]

Bases: object

reset()[source]
value()[source]
name()[source]
class SocialED.detector.kpgnn.AccumulatedAccuracyMetric[source]

Bases: Metric

Works with classification model

reset()[source]
value()[source]
name()[source]
class SocialED.detector.kpgnn.AverageNonzeroTripletsMetric[source]

Bases: Metric

Counts average number of nonzero triplets found in minibatches

reset()[source]
value()[source]
name()[source]
class SocialED.detector.kpgnn.GATLayer(*args: Any, **kwargs: Any)[source]

Bases: Module

reset_parameters()[source]

Reinitialize learnable parameters.

edge_attention(edges)[source]
message_func(edges)[source]
reduce_func(nodes)[source]
forward(block)[source]
class SocialED.detector.kpgnn.MultiHeadGATLayer(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(block)[source]
class SocialED.detector.kpgnn.GAT(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(blocks, features)[source]
class SocialED.detector.kpgnn.AvgReadout(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(seq)[source]
class SocialED.detector.kpgnn.Discriminator(*args: Any, **kwargs: Any)[source]

Bases: Module

weights_init(m)[source]
forward(c, h_pl, h_mi, s_bias1=None, s_bias2=None)[source]
class SocialED.detector.kpgnn.DGI(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(nf)[source]
embed(nf)[source]
class SocialED.detector.kpgnn.OnlineTripletLoss(*args: Any, **kwargs: Any)[source]

Bases: Module

Online Triplets loss Takes a batch of embeddings and corresponding labels. Triplets are generated using triplet_selector object that take embeddings and targets and return indices of triplets

forward(embeddings, target)[source]
SocialED.detector.kpgnn.pdist(vectors)[source]
class SocialED.detector.kpgnn.TripletSelector[source]

Bases: object

Implementation should return indices of anchors, positive and negative samples return np array of shape [N_triplets x 3]

get_triplets(embeddings, labels)[source]
class SocialED.detector.kpgnn.FunctionNegativeTripletSelector(margin, negative_selection_fn, cpu=True)[source]

Bases: TripletSelector

For each positive pair, takes the hardest negative sample (with the greatest triplet loss value) to create a triplet Margin should match the margin used in triplet loss. negative_selection_fn should take array of loss_values for a given anchor-positive pair and all negative samples and return a negative index for that pair

get_triplets(embeddings, labels)[source]
SocialED.detector.kpgnn.random_hard_negative(loss_values)[source]
SocialED.detector.kpgnn.hardest_negative(loss_values)[source]
SocialED.detector.kpgnn.HardestNegativeTripletSelector(margin, cpu=False)[source]
SocialED.detector.kpgnn.RandomNegativeTripletSelector(margin, cpu=False)[source]
class SocialED.detector.kpgnn.SocialDataset(*args: Any, **kwargs: Any)[source]

Bases: Dataset

load_adj_matrix(path, index)[source]
remove_obsolete_nodes(indices_to_remove=None)[source]

FinEvent

class SocialED.detector.finevent.FinEvent(dataset, n_epochs=1, window_size=3, patience=5, margin=3.0, lr=0.001, batch_size=50, hidden_dim=128, out_dim=64, heads=4, validation_percent=0.2, use_hardest_neg=False, is_shared=False, inter_opt='cat_w_avg', is_initial=True, sampler='RL_sampler', cluster_type='kmeans', threshold_start0=[[0.2], [0.2], [0.2]], RL_step0=0.02, RL_start0=0, eps_start=0.001, eps_step=0.02, min_Pts_start=2, min_Pts_step=1, use_cuda=True, data_path='../model/model_saved/finevent/incremental_test/', file_path='../model/model_saved/finevent/', mask_path=None, log_interval=10)[source]

Bases: object

The FinEvent model for social event detection that uses graph neural networks and reinforcement learning for adaptive event detection.

Note

This detector uses graph neural networks and reinforcement learning to identify events in social media data. The model requires a dataset object with a load_data() method.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • n_epochs (int, optional) – Number of training epochs. Default: 1.

  • window_size (int, optional) – Size of sliding window for incremental learning. Default: 3.

  • patience (int, optional) – Number of epochs to wait before early stopping. Default: 5.

  • margin (float, optional) – Margin for triplet loss. Default: 3.0.

  • lr (float, optional) – Learning rate. Default: 1e-3.

  • batch_size (int, optional) – Mini-batch size. Default: 50.

  • hidden_dim (int, optional) – Hidden layer dimension. Default: 128.

  • out_dim (int, optional) – Output dimension. Default: 64.

  • heads (int, optional) – Number of attention heads. Default: 4.

  • validation_percent (float, optional) – Percentage of data for validation. Default: 0.2.

  • use_hardest_neg (bool, optional) – Whether to use hardest negative mining. Default: False.

  • is_shared (bool, optional) – Whether to use shared parameters. Default: False.

  • inter_opt (str, optional) – Integration option for multi-view features. Default: 'cat_w_avg'.

  • is_initial (bool, optional) – Whether to initialize model. Default: True.

  • sampler (str, optional) – Type of sampler to use. Default: 'RL_sampler'.

  • cluster_type (str, optional) – Clustering algorithm to use. Default: 'kmeans'.

  • threshold_start0 (list, optional) – Initial thresholds for RL-0. Default: [[0.2], [0.2], [0.2]].

  • RL_step0 (float, optional) – Step size for RL-0. Default: 0.02.

  • RL_start0 (int, optional) – Starting point for RL-0. Default: 0.

  • eps_start (float, optional) – Initial epsilon for RL-1. Default: 0.001.

  • eps_step (float, optional) – Step size for epsilon in RL-1. Default: 0.02.

  • min_Pts_start (int, optional) – Initial minimum points for RL-1. Default: 2.

  • min_Pts_step (int, optional) – Step size for minimum points in RL-1. Default: 1.

  • use_cuda (bool, optional) – Whether to use GPU acceleration. Default: True.

  • data_path (str, optional) – Path to data directory. Default: '../model/model_saved/finevent/incremental_test/'.

  • file_path (str, optional) – Path to save model files. Default: '../model/model_saved/finevent/'.

  • mask_path (str, optional) – Path to attention mask file. Default: None.

  • log_interval (int, optional) – Number of steps between logging. Default: 10.

preprocess()[source]
fit()[source]
detection()[source]
evaluate(predictions, ground_truths)[source]
class SocialED.detector.finevent.FinEvent_model(args)[source]

Bases: FinEvent

inference(train_i, i, metrics, embedding_save_path, loss_fn, model, RL_thresholds=None, loss_fn_dgi=None)[source]
initial_maintain(train_i, i, metrics, embedding_save_path, loss_fn, model=None, loss_fn_dgi=None)[source]
Parameters:
  • i

  • data_split

  • metrics

  • embedding_save_path

  • loss_fn

  • model

  • loss_fn_dgi

Returns:

class SocialED.detector.finevent.Preprocessor[source]

Bases: object

documents_to_features(df)[source]
extract_time_feature(t_str)[source]
df_to_t_features(df)[source]
generate_initial_features(df, save_path='../model/model_saved/finevent/')[source]
construct_graph_from_df(df, G=None)[source]
construct_incremental_dataset(df, save_path, features, test=True)[source]
construct_graph(df, save_path='../model/model_saved/finevent/incremental_test/')[source]
networkx_to_dgl_graph(G, save_path=None)[source]
save_edge_index(data_path='../model/model_saved/finevent/incremental_test')[source]
SocialED.detector.finevent.sparse_trans(datapath='incremental_test/0/s_m_tid_userid_tid.npz')[source]
SocialED.detector.finevent.coo_trans(datapath='incremental_test/0/s_m_tid_userid_tid.npz')[source]
SocialED.detector.finevent.create_dataset(loadpath, relation, mode)[source]
SocialED.detector.finevent.create_homodataset(loadpath, mode, valid_percent=0.2)[source]
SocialED.detector.finevent.create_offline_homodataset(loadpath, mode)[source]
SocialED.detector.finevent.create_multi_relational_graph(loadpath, relations, mode)[source]
SocialED.detector.finevent.save_multi_relational_graph(loadpath, relations, mode)[source]
SocialED.detector.finevent.intersection(lst1, lst2)[source]
SocialED.detector.finevent.run_hdbscan(extract_features, extract_labels, indices, is_validation, isoPath=None)[source]
SocialED.detector.finevent.run_kmeans(extract_features, extract_labels, indices, isoPath=None)[source]
SocialED.detector.finevent.evaluate_model(extract_features, extract_labels, indices, epoch, num_isolated_nodes, save_path, is_validation=True, cluster_type='kmeans')[source]
SocialED.detector.finevent.generateMasks(length, data_split, train_i, i, validation_percent=0.2, save_path=None, remove_obsolete=2)[source]

Intro: This function generates train and validation indices for initial/maintenance epochs and test indices for inference(prediction) epochs If remove_obsolete mode 0 or 1: For initial/maintenance epochs: - The first (train_i + 1) blocks (blocks 0, …, train_i) are used as training set (with explicit labels) - Randomly sample validation_percent of the training indices as validation indices For inference(prediction) epochs: - The (i + 1)th block (block i) is used as test set Note that other blocks (block train_i + 1, …, i - 1) are also in the graph (without explicit labels, only their features and structural info are leveraged) If remove_obsolete mode 2: For initial/maintenance epochs: - The (i + 1) = (train_i + 1)th block (block train_i = i) is used as training set (with explicit labels) - Randomly sample validation_percent of the training indices as validation indices For inference(prediction) epochs: - The (i + 1)th block (block i) is used as test set

Parameters:
  • length – the length of label list

  • data_split – loaded splited data (generated in custom_message_graph.py)

  • i (train_i,) – flag, indicating for initial/maintenance stage if train_i == i and inference stage for others

  • validation_percent – the percent of validation data occupied in whole dataset

  • save_path – path to save data

  • num_indices_to_remove – number of indices ought to be removed

:returns train indices, validation indices or test indices

SocialED.detector.finevent.gen_offline_masks(length, validation_percent=0.2, test_percent=0.1)[source]
SocialED.detector.finevent.save_embeddings(extracted_features, save_path)[source]
class SocialED.detector.finevent.MySampler(sampler)[source]

Bases: object

sample(multi_relational_edge_index: List[torch.functional.Tensor], node_idx, sizes, batch_size)[source]
class SocialED.detector.finevent.Metric[source]

Bases: object

reset()[source]
value()[source]
name()[source]
class SocialED.detector.finevent.AccumulatedAccuracyMetric[source]

Bases: Metric

Works with classification model

reset()[source]
value()[source]
name()[source]
class SocialED.detector.finevent.AverageNonzeroTripletsMetric[source]

Bases: Metric

Counts average number of nonzero triplets found in minibatches

reset()[source]
value()[source]
name()[source]
class SocialED.detector.finevent.MarGNN(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(x, adjs, n_ids, device, RL_thresholds)[source]
SocialED.detector.finevent.RL_neighbor_filter_full(multi_r_data, RL_thresholds, features, save_path=None)[source]
SocialED.detector.finevent.multi_forward_agg(args, foward_args, iter_epoch)[source]
class SocialED.detector.finevent.GAT(*args: Any, **kwargs: Any)[source]

Bases: Module

adopt this module when using mini-batch

forward(x, adjs, device)[source]
class SocialED.detector.finevent.Intra_AGG(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(x, adjs, device)[source]
class SocialED.detector.finevent.Inter_AGG(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(features, thresholds, inter_opt)[source]
SocialED.detector.finevent.pre_node_dist(multi_r_data, features, save_path=None)[source]

This is used to culculate the similarity between node and its neighbors in advance in order to avoid the repetitive computation.

Parameters:
  • multi_r_data ([type]) – [description]

  • features ([type]) – [description]

  • save_path ([type], optional) – [description]. Defaults to None.

SocialED.detector.finevent.RL_neighbor_filter(args, multi_r_data, RL_thresholds, load_path)[source]
class SocialED.detector.finevent.AvgReadout(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(seq)[source]
class SocialED.detector.finevent.Discriminator(*args: Any, **kwargs: Any)[source]

Bases: Module

weights_init(m)[source]
forward(c, h_pl, h_mi, s_bias1=None, s_bias2=None)[source]
class SocialED.detector.finevent.OnlineTripletLoss(*args: Any, **kwargs: Any)[source]

Bases: Module

Online Triplets loss Takes a batch of embeddings and corresponding labels. Triplets are generated using triplet_selector object that take embeddings and targets and return indices of triplets.

forward(embeddings, target)[source]
SocialED.detector.finevent.pdist(vectors)[source]
class SocialED.detector.finevent.TripletSelector[source]

Bases: object

Implementation should return indices of anchors, positive and negative samples return np array of shape [N_triplets x 3]

get_triplets(embeddings, labels)[source]
class SocialED.detector.finevent.FunctionNegativeTripletSelector(margin, negative_selection_fn, cpu=True)[source]

Bases: TripletSelector

For each positive pair, takes the hardest negative sample (with the greatest triplet loss value) to create a triplet Margin should match the margin used in triplet loss. negative_selection_fn should take array of loss_values for a given anchor-positive pair and all negative samples and return a negative index for that pair

get_triplets(embeddings, labels)[source]
SocialED.detector.finevent.random_hard_negative(loss_values)[source]
SocialED.detector.finevent.hardest_negative(loss_values)[source]
SocialED.detector.finevent.HardestNegativeTripletSelector(margin, cpu=False)[source]
SocialED.detector.finevent.RandomNegativeTripletSelector(margin, cpu=False)[source]

QSGNN

class SocialED.detector.qsgnn.QSGNN(dataset, finetune_epochs=1, n_epochs=5, oldnum=20, novelnum=20, n_infer_epochs=0, window_size=3, patience=5, margin=3.0, a=8.0, lr=0.001, batch_size=1000, n_neighbors=1200, word_embedding_dim=300, hidden_dim=16, out_dim=64, num_heads=4, use_residual=True, validation_percent=0.1, test_percent=0.2, use_hardest_neg=True, metrics='nmi', use_cuda=True, add_ort=True, gpuid=0, mask_path=None, log_interval=10, is_incremental=True, data_path='../model/model_saved/qsgnn/English', file_path='../model/model_saved/qsgnn', add_pair=False, initial_lang='English', is_static=False, graph_lang='English', days=2)[source]

Bases: object

The QSGNN model for social event detection that uses a query-based streaming graph neural network for event detection.

Note

This detector uses graph neural networks with query-based streaming to identify events in social media data. The model requires a dataset object with a load_data() method.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • finetune_epochs (int, optional) – Number of fine-tuning epochs. Default: 1.

  • n_epochs (int, optional) – Number of training epochs. Default: 5.

  • oldnum (int, optional) – Number of old classes. Default: 20.

  • novelnum (int, optional) – Number of novel classes. Default: 20.

  • n_infer_epochs (int, optional) – Number of inference epochs. Default: 0.

  • window_size (int, optional) – Size of sliding window. Default: 3.

  • patience (int, optional) – Early stopping patience. Default: 5.

  • margin (float, optional) – Margin for triplet loss. Default: 3.0.

  • a (float, optional) – Scaling factor. Default: 8.0.

  • lr (float, optional) – Learning rate for optimizer. Default: 1e-3.

  • batch_size (int, optional) – Batch size for training. Default: 1000.

  • n_neighbors (int, optional) – Number of neighbors to sample. Default: 1200.

  • word_embedding_dim (int, optional) – Word embedding dimension. Default: 300.

  • hidden_dim (int, optional) – Hidden layer dimension. Default: 16.

  • out_dim (int, optional) – Output dimension. Default: 64.

  • num_heads (int, optional) – Number of attention heads. Default: 4.

  • use_residual (bool, optional) – Whether to use residual connections. Default: True.

  • validation_percent (float, optional) – Percentage of data for validation. Default: 0.1.

  • test_percent (float, optional) – Percentage of data for testing. Default: 0.2.

  • use_hardest_neg (bool, optional) – Whether to use hardest negative mining. Default: True.

  • metrics (str, optional) – Evaluation metric to use. Default: 'nmi'.

  • use_cuda (bool, optional) – Whether to use GPU acceleration. Default: True.

  • add_ort (bool, optional) – Whether to add orthogonal regularization. Default: True.

  • gpuid (int, optional) – GPU device ID to use. Default: 0.

  • mask_path (str, optional) – Path to mask file. Default: None.

  • log_interval (int, optional) – Number of steps between logging. Default: 10.

  • is_incremental (bool, optional) – Whether to use incremental learning. Default: True.

  • data_path (str, optional) – Path to save model data. Default: '../model/model_saved/qsgnn/English'.

  • file_path (str, optional) – Path to save model files. Default: '../model/model_saved/qsgnn'.

  • add_pair (bool, optional) – Whether to add pair-wise constraints. Default: False.

  • initial_lang (str, optional) – Initial language for processing. Default: 'English'.

  • is_static (bool, optional) – Whether to use static graph. Default: False.

  • graph_lang (str, optional) – Language for graph construction. Default: 'English'.

  • days (int, optional) – Number of days for temporal window. Default: 2.

preprocess()[source]
fit()[source]
detection()[source]
evaluate(predictions, ground_truths)[source]
class SocialED.detector.qsgnn.Preprocessor(dataser)[source]

Bases: QSGNN

generate_initial_features(dataset)[source]
documents_to_features(df, initial_lang)[source]
extract_time_feature(t_str)[source]
df_to_t_features(df)[source]
construct_graph()[source]
construct_graph_from_df(df, G=None)[source]
networkx_to_dgl_graph(G, save_path=None)[source]
construct_incremental_dataset(args, df, save_path, features, test=False)[source]
class SocialED.detector.qsgnn.SocialDataset(*args: Any, **kwargs: Any)[source]

Bases: Dataset

load_adj_matrix(path, index)[source]
remove_obsolete_nodes(indices_to_remove=None)[source]
class SocialED.detector.qsgnn.Arabic_preprocessor(tokenizer, **cfg)[source]

Bases: object

clean_text(text)[source]
class SocialED.detector.qsgnn.EDNN(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(x)[source]
class SocialED.detector.qsgnn.simNN(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(x)[source]
class SocialED.detector.qsgnn.GATLayer(*args: Any, **kwargs: Any)[source]

Bases: Module

reset_parameters()[source]

Reinitialize learnable parameters.

edge_attention(edges)[source]
message_func(edges)[source]
reduce_func(nodes)[source]
forward(blocks, layer_id)[source]
class SocialED.detector.qsgnn.MultiHeadGATLayer(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(blocks, layer_id)[source]
class SocialED.detector.qsgnn.GAT(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(blocks)[source]
SocialED.detector.qsgnn.graph_statistics(G, save_path)[source]
SocialED.detector.qsgnn.generateMasks(length, data_split, i, validation_percent=0.2, test_percent=0.2, save_path=None)[source]
SocialED.detector.qsgnn.getdata(embedding_save_path, data_split, i, args)[source]
SocialED.detector.qsgnn.intersection(lst1, lst2)[source]
SocialED.detector.qsgnn.run_kmeans(extract_features, extract_labels, indices, args, isoPath=None)[source]
SocialED.detector.qsgnn.evaluate(extract_features, extract_labels, indices, epoch, num_isolated_nodes, save_path, args, is_validation=True)[source]
SocialED.detector.qsgnn.extract_embeddings(g, model, num_all_samples, args)[source]
SocialED.detector.qsgnn.initial_train(i, args, data_split, metrics, embedding_save_path, loss_fn, model=None)[source]
SocialED.detector.qsgnn.continue_train(i, data_split, metrics, embedding_save_path, loss_fn, model, label_center_emb, args)[source]
class SocialED.detector.qsgnn.OnlineTripletLoss(*args: Any, **kwargs: Any)[source]

Bases: Module

Online Triplets loss Takes a batch of embeddings and corresponding labels. Triplets are generated using triplet_selector object that take embeddings and targets and return indices of triplets

forward(embeddings, target)[source]
SocialED.detector.qsgnn.pdist(vectors)[source]
class SocialED.detector.qsgnn.TripletSelector[source]

Bases: object

Implementation should return indices of anchors, positive and negative samples return np array of shape [N_triplets x 3]

get_triplets(embeddings, labels)[source]
class SocialED.detector.qsgnn.FunctionNegativeTripletSelector(margin, negative_selection_fn, cpu=True)[source]

Bases: TripletSelector

For each positive pair, takes the hardest negative sample (with the greatest triplet loss value) to create a triplet Margin should match the margin used in triplet loss. negative_selection_fn should take array of loss_values for a given anchor-positive pair and all negative samples and return a negative index for that pair

get_triplets(embeddings, labels)[source]
SocialED.detector.qsgnn.print_scores(scores)[source]
SocialED.detector.qsgnn.random_hard_negative(loss_values)[source]
SocialED.detector.qsgnn.hardest_negative(loss_values)[source]
SocialED.detector.qsgnn.HardestNegativeTripletSelector(margin, cpu=False)[source]
SocialED.detector.qsgnn.RandomNegativeTripletSelector(margin, cpu=False)[source]
SocialED.detector.qsgnn.relu_evidence(y)[source]
SocialED.detector.qsgnn.exp_evidence(y)[source]
SocialED.detector.qsgnn.softplus_evidence(y)[source]
SocialED.detector.qsgnn.kl_divergence(alpha, num_classes, device)[source]
SocialED.detector.qsgnn.loglikelihood_loss(y, alpha, device)[source]
SocialED.detector.qsgnn.mse_loss(y, alpha, epoch_num, num_classes, annealing_step, device)[source]
SocialED.detector.qsgnn.edl_loss(func, y, alpha, epoch_num, num_classes, annealing_step, device)[source]
SocialED.detector.qsgnn.edl_mse_loss(alpha, target, epoch_num, num_classes, annealing_step, device)[source]
SocialED.detector.qsgnn.edl_log_loss(alpha, target, epoch_num, num_classes, annealing_step, device)[source]
SocialED.detector.qsgnn.edl_digamma_loss(alpha, target, epoch_num, num_classes, annealing_step, device)[source]
SocialED.detector.qsgnn.pairwise_sample(embeddings, labels=None, model=None)[source]
class SocialED.detector.qsgnn.Metric[source]

Bases: object

reset()[source]
value()[source]
name()[source]
class SocialED.detector.qsgnn.AccumulatedAccuracyMetric[source]

Bases: Metric

Works with classification model

reset()[source]
value()[source]
name()[source]
class SocialED.detector.qsgnn.AverageNonzeroTripletsMetric[source]

Bases: Metric

Counts average number of nonzero triplets found in minibatches

reset()[source]
value()[source]
name()[source]

HCRC

SocialED.detector.hcrc.currentTime()[source]
class SocialED.detector.hcrc.HCRC(dataset, file_path: str = '../model/model_saved/hcrc/', result_path: str = '../model/model_saved/hcrc/res.txt', task: str = 'DRL', layers: str = '[256]', N_pred_hid: int = 64, G_pred_hid: int = 16, eval_freq: float = 5, mad: float = 0.9, Glr: float = 6e-07, Nlr: float = 1e-05, Ges: int = 50, Nes: int = 2000, Gepochs: int = 105, Nepochs: int = 100, device: int = 0)[source]

Bases: object

The HCRC model for social event detection that uses hierarchical clustering and reinforcement learning for adaptive event detection.

Note

This detector uses hierarchical clustering and reinforcement learning to adaptively detect events in social media data. The model requires a dataset object with a load_data() method.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • file_path (str, optional) – Path to save model files. Default: '../model/model_saved/hcrc/'.

  • result_path (str, optional) – Path to save results file. Default: '../model/model_saved/hcrc/res.txt'.

  • task (str, optional) – Task type, e.g. ‘DRL’ for deep reinforcement learning. Default: 'DRL'.

  • layers (str, optional) – Hidden layer dimensions as string. Default: '[256]'.

  • N_pred_hid (int, optional) – Node prediction hidden dimension. Default: 64.

  • G_pred_hid (int, optional) – Graph prediction hidden dimension. Default: 16.

  • eval_freq (float, optional) – Evaluation frequency. Default: 5.

  • mad (float, optional) – Moving average decay rate. Default: 0.9.

  • Glr (float, optional) – Learning rate for graph model. Default: 0.0000006.

  • Nlr (float, optional) – Learning rate for node model. Default: 0.00001.

  • Ges (int, optional) – Graph model early stopping patience. Default: 50.

  • Nes (int, optional) – Node model early stopping patience. Default: 2000.

  • Gepochs (int, optional) – Number of graph model training epochs. Default: 105.

  • Nepochs (int, optional) – Number of node model training epochs. Default: 100.

  • device (int, optional) – GPU device ID to use. Default: 0.

fit()[source]
detection()[source]
evaluate(predictions, ground_truths)[source]
class SocialED.detector.hcrc.EMA(beta, epochs)[source]

Bases: object

update_average(old, new)[source]
SocialED.detector.hcrc.get_task(strs)[source]
SocialED.detector.hcrc.init_weights(m)[source]
SocialED.detector.hcrc.sim(z1, z2)[source]
SocialED.detector.hcrc.semi_loss(z1, z2)[source]
SocialED.detector.hcrc.get_loss(h1, h2)[source]
SocialED.detector.hcrc.update_moving_average(ema_updater, ma_model, current_model)[source]
SocialED.detector.hcrc.set_requires_grad(model, val)[source]
SocialED.detector.hcrc.enumerateConfig(args)[source]
SocialED.detector.hcrc.config2string(args)[source]
SocialED.detector.hcrc.printConfig(args)[source]
class SocialED.detector.hcrc.embedder(args)[source]

Bases: object

class SocialED.detector.hcrc.Encoder(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(x, edge_index)[source]
SocialED.detector.hcrc.DRL_cluster(all_embeddings, block_num, pred_label)[source]
SocialED.detector.hcrc.random_cluster(all_embeddings, block_num, pred_label)[source]
SocialED.detector.hcrc.semi_cluster(all_embeddings, label, block_num, pred_label)[source]
SocialED.detector.hcrc.NMI_cluster(all_embeddings, label, block_num, pred_label)[source]
SocialED.detector.hcrc.evaluate_fun(all_embeddings, label, block_num, pred_label, result_path, task)[source]
class SocialED.detector.hcrc.SinglePass(sim_threshold, data, flag, label, size, agent, para, sim_init, sim=False, global_step=0)[source]

Bases: object

clustering(sen_vec)[source]
clustering_init(t, sen_vec)[source]
run_cluster_init(t, size)[source]
run_cluster_sim(flag, size, para, sim_init, sim, data)[source]
run_cluster(flag, size)[source]
get_center(label, data)[source]
get_info_cluster(text_vec, indexs_per_cluster)[source]
get_state(sim, sim_init, data)[source]
get_reward(sim_init, data)[source]
class SocialED.detector.hcrc.Node_ModelTrainer(args, block_num)[source]

Bases: embedder

get_embedding()[source]
train()[source]
class SocialED.detector.hcrc.NodeLevel(*args: Any, **kwargs: Any)[source]

Bases: Module

reset_moving_average()[source]
update_moving_average()[source]
forward(batch)[source]
class SocialED.detector.hcrc.Graph_ModelTrainer(args, block_num)[source]

Bases: embedder

get_embedding()[source]
train()[source]
class SocialED.detector.hcrc.GraphLevel(*args: Any, **kwargs: Any)[source]

Bases: Module

reset_moving_average()[source]
update_moving_average()[source]
forward(batch)[source]
SocialED.detector.hcrc.make_transition(trans, *items)[source]
SocialED.detector.hcrc.make_batch(state, action, old_log_prob, advantage, old_value, learn_size, batch_size, use_cuda)[source]
SocialED.detector.hcrc.calculate_nature_cnn_out_dim(height, weight)[source]
class SocialED.detector.hcrc.DQN_Config(input_type, input_size=None)[source]

Bases: object

class SocialED.detector.hcrc.DQN(state_dim, action_dim, input_type='vector', args=None)[source]

Bases: object

select_action(state, epsilon=None)[source]
add_buffer(transition)[source]
epsilon_decay()[source]
update_network()[source]
save_model(model_path)[source]
learn(step)[source]
class SocialED.detector.hcrc.PPO_Config(input_type, input_size=None)[source]

Bases: object

class SocialED.detector.hcrc.PPO(state_dim, action_dim, continuous=True, input_type='vector', args=None)[source]

Bases: object

select_action(state)[source]
add_buffer(transition)[source]
save_model(model_path)[source]
learn()[source]
class SocialED.detector.hcrc.ActorCritic(*args: Any, **kwargs: Any)[source]

Bases: Module

forward()[source]
act(state)[source]
evaluate_AC(state, action)[source]
class SocialED.detector.hcrc.MLPEncoder(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(x)[source]
get_dim()[source]
class SocialED.detector.hcrc.CNNEncoder(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(x)[source]
get_dim()[source]
class SocialED.detector.hcrc.QNet(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(state)[source]
class SocialED.detector.hcrc.BaseBuffer(trans, max_len)[source]

Bases: object

get_len()[source]
clear()[source]

clear the buffer :return:

add(transition)[source]

add a transition in buffer :return:

get_data_buffer()[source]
sample(size)[source]
SocialED.detector.hcrc.unique(lists)[source]
SocialED.detector.hcrc.construct_graph_from_df(df, G=None)[source]
SocialED.detector.hcrc.construct_graph(data, feature, index)[source]
SocialED.detector.hcrc.normalize_adj(adj)[source]
SocialED.detector.hcrc.aug_edge(adj)[source]
SocialED.detector.hcrc.get_edge_index(adj)[source]
SocialED.detector.hcrc.get_data(message_num, start, tweet_sum, save_path)[source]
SocialED.detector.hcrc.getData(args, M_num)[source]
SocialED.detector.hcrc.save_data(data, save_path, M_num)[source]
SocialED.detector.hcrc.get_Graph_Dataset(args, message_number)[source]
SocialED.detector.hcrc.get_Node_Dataset(args, message_number)[source]
SocialED.detector.hcrc.documents_to_features(df)[source]
SocialED.detector.hcrc.extract_time_feature(t_str)[source]
SocialED.detector.hcrc.df_to_t_features(df)[source]

UCLSED

class SocialED.detector.uclsed.UCLSED(dataset, file_path='../model/model_saved/uclsed/', epoch=50, batch_size=128, neighbours_num=80, GNN_h_dim=256, GNN_out_dim=256, E_h_dim=128, use_uncertainty=True, use_cuda=True, gpuid=0, mode=0, mse=False, digamma=True, log=False, learning_rate=0.0001, weight_decay=1e-05)[source]

Bases: object

The UCLSED model for social event detection that uses uncertainty-aware contrastive learning for event detection.

Parameters:
  • dataset (*) – The dataset object containing social media data. The dataset should provide methods: - load_data(): Returns the raw data - get_dataset_language(): Returns the language of the dataset

  • file_path (*) – Path to save model files. (default: ‘../model/model_saved/uclsed/’)

  • epoch (*) – Number of training epochs. (default: 50)

  • batch_size (*) – Batch size for training. (default: 128)

  • neighbours_num (*) – Number of neighbors to sample. (default: 80)

  • GNN_h_dim (*) – Hidden dimension of GNN. (default: 256)

  • GNN_out_dim (*) – Output dimension of GNN. (default: 256)

  • E_h_dim (*) – Hidden dimension of encoder. (default: 128)

  • use_uncertainty (*) – Whether to use uncertainty estimation. (default: True)

  • use_cuda (*) – Whether to use GPU acceleration. (default: True)

  • gpuid (*) – GPU device ID to use. (default: 0)

  • mode (*) – Training mode. (default: 0)

  • mse (*) – Whether to use MSE loss. (default: False)

  • digamma (*) – Whether to use digamma function. (default: True)

  • log (*) – Whether to use log transformation. (default: False)

  • learning_rate (*) – Learning rate for optimizer. (default: 1e-4)

  • weight_decay (*) – Weight decay for optimizer. (default: 1e-5)

preprocess()[source]
fit()[source]
detection()[source]
evaluate(ground_truth, predictions)[source]
class SocialED.detector.uclsed.Preprocessor(args)[source]

Bases: object

str2list(str_ele)[source]
load_data(dataset)[source]
get_nlp(lang)[source]
construct_graph_base_eles(view_dict, df, path, lang)[source]
construct_graph(dataset, lang)[source]
SocialED.detector.uclsed.extract_results(g_dict, views, labels, model, args, train_indices=None)[source]
SocialED.detector.uclsed.train_model(model, g_dict, views, features, times, labels, epoch, criterion, mask_path, save_path, args)[source]
class SocialED.detector.uclsed.Tem_Agg_Layer(*args: Any, **kwargs: Any)[source]

Bases: Module

reset_parameters()[source]
edge_attention(edges)[source]
message_func(edges)[source]
reduce_func(nodes)[source]
forward(blocks, layer_id)[source]
class SocialED.detector.uclsed.GNN(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(blocks)[source]
edge_attention(edges)[source]
calculate_attention(edges)[source]
class SocialED.detector.uclsed.EDNN(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(x)[source]
class SocialED.detector.uclsed.UCLSED_model(*args: Any, **kwargs: Any)[source]

Bases: Module

forward(blocks_dict, is_EDNN_input=False, i=None, emb_v=None)[source]
SocialED.detector.uclsed.common_loss(emb1, emb2)[source]
SocialED.detector.uclsed.EUC_loss(alpha, u, true_labels, e)[source]
SocialED.detector.uclsed.kl_divergence(alpha, num_classes, device)[source]
SocialED.detector.uclsed.kl_pred_divergence(alpha, y, num_classes, device)[source]
SocialED.detector.uclsed.loglikelihood_loss(y, alpha, device)[source]
SocialED.detector.uclsed.mse_loss(y, alpha, epoch_num, num_classes, annealing_step, device)[source]
SocialED.detector.uclsed.edl_loss(func, y, true_labels, alpha, epoch_num, num_classes, annealing_step, device)[source]
SocialED.detector.uclsed.edl_mse_loss(alpha, target, true_labels, epoch_num, num_classes, annealing_step, device)[source]
SocialED.detector.uclsed.edl_log_loss(alpha, target, true_labels, epoch_num, num_classes, annealing_step, device)[source]
SocialED.detector.uclsed.edl_digamma_loss(alpha, target, true_labels, epoch_num, num_classes, annealing_step, device)[source]
SocialED.detector.uclsed.make_onehot(input, classes)[source]
SocialED.detector.uclsed.relu_evidence(y)[source]
SocialED.detector.uclsed.exp_evidence(y)[source]
SocialED.detector.uclsed.softplus_evidence(y)[source]
SocialED.detector.uclsed.DS_Combin(alpha, classes)[source]
Parameters:

alpha – All Dirichlet distribution parameters.

Returns:

Combined Dirichlet distribution parameters.

SocialED.detector.uclsed.graph_statistics(G, save_path)[source]
SocialED.detector.uclsed.get_dgl_data(args, views, language)[source]
SocialED.detector.uclsed.split_data(length, train_p, val_p, test_p)[source]
SocialED.detector.uclsed.ava_split_data(length, labels, classes)[source]

RPLMSED

class SocialED.detector.rplmsed.RPLMSED(dataset, plm_path='../model/model_needed/base_plm_model/roberta-large', file_path='../model/model_saved/rplmsed/', plm_tuning=False, use_ctx_att=False, offline=True, ctx_att_head_num=2, pmt_feats=(0, 1, 2, 4), batch_size=128, lmda1=0.01, lmda2=0.005, tao=0.9, optimizer='Adam', lr=2e-05, weight_decay=1e-05, momentum=0.9, step_lr_gamma=0.98, max_epochs=1, ckpt_path='../model/model_saved/rplmsed/ckpt/', eva_data='../model/model_saved/rplmsed/Eva_data/', early_stop_patience=2, early_stop_monitor='loss', SAMPLE_NUM_TWEET=60, WINDOW_SIZE=3, device='cpu')[source]

Bases: object

The RPLMSED model for social event detection that uses pre-trained language models with prompt learning for event detection.

Note

This detector uses prompt learning with pre-trained language models to identify events in social media data. The model requires a dataset object with a load_data() method.

Parameters:
  • dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.

  • plm_path (str, optional) – Path to pre-trained language model. Default: '../model/model_needed/base_plm_model/roberta-large'.

  • file_path (str, optional) – Path to save model files. Default: '../model/model_saved/rplmsed/'.

  • plm_tuning (bool, optional) – Whether to fine-tune PLM. Default: False.

  • use_ctx_att (bool, optional) – Whether to use context attention. Default: False.

  • offline (bool, optional) – Whether to use offline mode. Default: True.

  • ctx_att_head_num (int, optional) – Number of context attention heads. Default: 2.

  • pmt_feats (tuple, optional) – Prompt feature indices to use. Default: (0,1,2,4).

  • batch_size (int, optional) – Batch size for training. Default: 128.

  • lmda1 (float, optional) – Lambda 1 hyperparameter. Default: 0.010.

  • lmda2 (float, optional) – Lambda 2 hyperparameter. Default: 0.005.

  • tao (float, optional) – Temperature parameter. Default: 0.90.

  • optimizer (str, optional) – Optimizer to use. Default: 'Adam'.

  • lr (float, optional) – Learning rate. Default: 2e-5.

  • weight_decay (float, optional) – Weight decay for optimizer. Default: 1e-5.

  • momentum (float, optional) – Momentum for optimizer. Default: 0.9.

  • step_lr_gamma (float, optional) – Learning rate decay factor. Default: 0.98.

  • max_epochs (int, optional) – Maximum training epochs. Default: 1.

  • ckpt_path (str, optional) – Path to save checkpoints. Default: '../model/model_saved/rplmsed/ckpt/'.

  • eva_data (str, optional) – Path to evaluation data. Default: '../model/model_saved/rplmsed/Eva_data/'.

  • early_stop_patience (int, optional) – Early stopping patience. Default: 2.

  • early_stop_monitor (str, optional) – Metric to monitor for early stopping. Default: 'loss'.

  • SAMPLE_NUM_TWEET (int, optional) – Number of tweets to sample. Default: 60.

  • WINDOW_SIZE (int, optional) – Size of sliding window. Default: 3.

  • device (str, optional) – Device to use for computation. Default: "cuda:0" if available else "cpu".

preprocess()[source]
fit()[source]
detection()[source]
evaluate(predictions, ground_truths)[source]
class SocialED.detector.rplmsed.DataItem(tweet_id, text, event_id, words, filtered_words, entities, user_id, created_at, urls, hashtags, user_mentions)

Bases: tuple

created_at

Alias for field number 7

entities

Alias for field number 5

event_id

Alias for field number 2

filtered_words

Alias for field number 4

hashtags

Alias for field number 9

text

Alias for field number 1

tweet_id

Alias for field number 0

urls

Alias for field number 8

user_id

Alias for field number 6

user_mentions

Alias for field number 10

words

Alias for field number 3

class SocialED.detector.rplmsed.Preprocessor[source]

Bases: object

preprocess_all(dataset)[source]
to_sparse_matrix(feat_to_tw, tw_num, tao=0)[source]
build_entity_adj(data)[source]
build_hashtag_adj(data)[source]
build_words_adj(data)[source]
build_user_adj(data)[source]
build_creat_at_adj(data)[source]
tweet_to_event(data)[source]
build_feats_adj(data, feats)[source]
build_feat_adj(data, cols)[source]
get_time_relation(tw_i, tw_j, delta: timedelta = datetime.timedelta(seconds=14400))[source]
make_train_samples(tw_adj, tw_to_ev, data)[source]
make_ref_samples(tw_adj, tw_to_ev, data)[source]
process_block(block)[source]
split_train_test_validation(data: List)[source]
split_into_blocks(data)[source]
pre_process(data)[source]
SocialED.detector.rplmsed.get_model(args)[source]
SocialED.detector.rplmsed.initialize(model, args, num_train_batch)[source]
SocialED.detector.rplmsed.batch_to_tensor(batch, args)[source]
SocialED.detector.rplmsed.create_trainer(model, optimizer, lr_scheduler, args)[source]
SocialED.detector.rplmsed.create_evaluator(model, args)[source]
SocialED.detector.rplmsed.create_tester(model, args, msg_feats, ref_num)[source]
SocialED.detector.rplmsed.test_on_block(model, cfg, blk, b=0)[source]
SocialED.detector.rplmsed.load_ckpt(model, args, ckpt, b)[source]
SocialED.detector.rplmsed.start_run(cfg, blocks)[source]
SocialED.detector.rplmsed.train_on_block(model, args, blk, blk_id=0)[source]
SocialED.detector.rplmsed.load_data_blocks(path_to_data, args, tokenizer)[source]
class SocialED.detector.rplmsed.CkptWrapper(state: Any)[source]

Bases: object

state_dict()[source]
SocialED.detector.rplmsed.get_model_state(model, params, plm_tuning)[source]
SocialED.detector.rplmsed.width(text)[source]
SocialED.detector.rplmsed.print_table(tab)[source]
SocialED.detector.rplmsed.data_generator(data, batch_size, shuffle=False, repeat=False)[source]
SocialED.detector.rplmsed.create_data_generator(data, batch_size, shuffle, repeat, batch_num)[source]
SocialED.detector.rplmsed.pad_seq(seq, max_len, pad=0, pad_left=False)[source]

padding or truncate sequence to fixed length :param seq: input sequence :param max_len: max length :param pad: padding token id :param pad_left: pad on left :return: padded sequence

SocialED.detector.rplmsed.run_kmeans(msg_feats, n_clust, msg_tags)[source]
SocialED.detector.rplmsed.run_hdbscan(msg_feats, msg_tags)[source]
SocialED.detector.rplmsed.run_dbscan(msg_feats, msg_tags)[source]
SocialED.detector.rplmsed.print_scores(scores)[source]
SocialED.detector.rplmsed.encode_samples(samples, raw_data, tokenizer, pmt_idx)[source]
SocialED.detector.rplmsed.count_condition(data, key, threshold)[source]
SocialED.detector.rplmsed.calculate_average_min_score(newscore, min_score, max_score)[source]
class SocialED.detector.rplmsed.StructAttention(*args: Any, **kwargs: Any)[source]

Bases: Module

The class is an implementation of the paper A Structured Self-Attentive Sentence Embedding

__init__(feat_dim, hid_dim, att_head_num=1)[source]

Initializes parameters suggested in paper :param feat_dim: {int} hidden dimension for lstm :param hid_dim: {int} hidden dimension for the dense layer :param att_head_num: {int} attention-hops or attention heads

Returns:

self

Raises:

Exception

forward(inpt, mask=None)[source]
Parameters:
  • inpt – [len, bsz, dim]

  • mask – [len, bsz]

Returns:

[bsz, head_num, dim], [bsz, head_num, len]

class SocialED.detector.rplmsed.PairPfxTuningEncoder(*args: Any, **kwargs: Any)[source]

Bases: Module

feat_size()[source]
reload_plm(device)[source]
accumulate_reload_plm(device, accumulate_rate=0.4)[source]
fix_plm()[source]
forward(inputs, types, prompt, mask)[source]

HISEvent

class SocialED.detector.hisevent.HISEvent(dataset)[source]

Bases: object

HISEvent class for event detection.

This class implements hierarchical structure-based event detection.

Parameters:
  • dataset – Input dataset

  • ...

preprocess()[source]
detection()[source]
evaluate(ground_truths, predictions)[source]

Evaluate the model.

class SocialED.detector.hisevent.Preprocessor(dataset, mode='close')[source]

Bases: object

__init__(dataset, mode='close')[source]

Initialize preprocessor :param dataset: Dataset calss (e.g. Event2012, Event2018, etc.) :param language: Language of the dataset (default ‘English’) :param mode: ‘open’ or ‘close’ (default ‘close’) - determines preprocessing mode

get_closed_set_test_df(df)[source]

Get closed set test dataframe

get_closed_set_messages_embeddings()[source]

Get SBERT embeddings for closed set messages

get_open_set_messages_embeddings()[source]

Get SBERT embeddings for open set messages

split_open_set(df, root_path)[source]

Split data into open set blocks

preprocess()[source]

Main preprocessing function

split_and_save_masks(df, save_dir, train_size=0.7, val_size=0.1, test_size=0.2, random_seed=42)[source]

Splits the DataFrame into training, validation, and test sets, and saves the indices (masks) as .pt files.

Parameters: - df (pd.DataFrame): The DataFrame to be split - save_dir (str): Directory to save the masks - train_size (float): Proportion for training (default 0.7) - val_size (float): Proportion for validation (default 0.1) - test_size (float): Proportion for testing (default 0.2) - random_seed (int): Random seed for reproducibility

SocialED.detector.hisevent.get_stable_point(path)[source]
SocialED.detector.hisevent.run_hier_2D_SE_mini_data(save_path, n=300, e_a=True, e_s=True)[source]
SocialED.detector.hisevent.search_stable_points(embeddings, max_num_neighbors=50)[source]
SocialED.detector.hisevent.get_graph_edges(attributes)[source]
SocialED.detector.hisevent.get_knn_edges(embeddings, default_num_neighbors)[source]
SocialED.detector.hisevent.get_global_edges(attributes, embeddings, default_num_neighbors, e_a=True, e_s=True)[source]
SocialED.detector.hisevent.get_subgraphs_edges(clusters, graph_splits, weighted_global_edges)[source]

Get subgraph edges.

Parameters:
  • clusters – a list containing the current clusters, each cluster is a list of nodes of the original graph

  • graph_splits – a list of (start_index, end_index) pairs, each (start_index, end_index) pair indicates a subset of clusters, which will serve as the nodes of a new subgraph

  • weighted_global_edges – a list of (start node, end node, edge weight) tuples, each tuple is an edge in the original graph

Returns:

a list containing the edges of all subgraphs

Return type:

all_subgraphs_edges

SocialED.detector.hisevent.hier_2D_SE_mini(weighted_global_edges, n_messages, n=100)[source]

hierarchical 2D SE minimization

class SocialED.detector.hisevent.SE(graph: networkx.Graph)[source]

Bases: object

get_vol()[source]

get the volume of the graph

calc_1dSE()[source]

get the 1D SE of the graph

update_1dSE(original_1dSE, new_edges)[source]

get the updated 1D SE after new edges are inserted into the graph

get_cut(comm)[source]

get the sum of the degrees of the cut edges of community comm

get_volume(comm)[source]

get the volume of community comm

calc_2dSE()[source]

get the 2D SE of the graph

show_division()[source]
show_struc_data()[source]
show_struc_data_2d()[source]
print_graph()[source]
update_struc_data()[source]

calculate the volume, cut, communitiy mode SE, and leaf nodes SE of each cummunity, then store them into self.struc_data

update_struc_data_2d()[source]

calculate the volume, cut, communitiy mode SE, and leaf nodes SE after merging each pair of cummunities, then store them into self.struc_data_2d

init_division()[source]

initialize self.division such that each node assigned to its own community

add_isolates()[source]

add any isolated nodes into graph

update_division_MinSE()[source]

greedily update the encoding tree to minimize 2D SE

SocialED.detector.hisevent.vanilla_2D_SE_mini(weighted_edges)[source]

vanilla (greedy) 2D SE minimization

SocialED.detector.hisevent.test_vanilla_2D_SE_mini()[source]
SocialED.detector.hisevent.replaceAtUser(text)[source]

Replaces “@user” with “”

SocialED.detector.hisevent.removeUnicode(text)[source]

Removes unicode strings like “,” and “x96”

SocialED.detector.hisevent.replaceURL(text)[source]

Replaces url address with “url”

SocialED.detector.hisevent.replaceMultiExclamationMark(text)[source]

Replaces repetitions of exlamation marks

SocialED.detector.hisevent.replaceMultiQuestionMark(text)[source]

Replaces repetitions of question marks

SocialED.detector.hisevent.removeEmoticons(text)[source]

Removes emoticons from text

SocialED.detector.hisevent.removeNewLines(text)[source]
SocialED.detector.hisevent.preprocess_sentence(s)[source]
SocialED.detector.hisevent.preprocess_french_sentence(s)[source]
SocialED.detector.hisevent.SBERT_embed(s_list, language)[source]

Use Sentence-BERT to embed sentences. s_list: a list of sentences/ tokens to be embedded. language: the language of the sentences (‘English’, ‘French’, ‘Arabic’). output: the embeddings of the sentences/ tokens.

SocialED.detector.hisevent.evaluate(labels_true, labels_pred)[source]
SocialED.detector.hisevent.decode(division)[source]

ADPSEMEvent

class SocialED.detector.adpsemevent.ADPSEMEvent(dataset)[source]

Bases: object

ADPSEMEvent class for event detection.

This class implements adaptive semantic event detection.

Parameters:
  • dataset – Input dataset

  • ...

preprocess()[source]
detection()[source]
evaluate(ground_truths, predictions)[source]

Evaluate the model.

class SocialED.detector.adpsemevent.Preprocessor(dataset, mode='close')[source]

Bases: object

__init__(dataset, mode='close')[source]

Initialize preprocessor :param dataset: Dataset calss (e.g. Event2012, Event2018, etc.) :param language: Language of the dataset (default ‘English’) :param mode: ‘open’ or ‘close’ (default ‘close’) - determines preprocessing mode

get_closed_set_test_df(df)[source]

Get closed set test dataframe

get_closed_set_messages_embeddings()[source]

Get SBERT embeddings for closed set messages

get_open_set_messages_embeddings()[source]

Get SBERT embeddings for open set messages

split_open_set(df, root_path)[source]

Split data into open set blocks

preprocess()[source]

Main preprocessing function

split_and_save_masks(df, save_dir, train_size=0.7, val_size=0.1, test_size=0.2, random_seed=42)[source]

Splits the DataFrame into training, validation, and test sets, and saves the indices (masks) as .pt files.

Parameters: - df (pd.DataFrame): The DataFrame to be split - save_dir (str): Directory to save the masks - train_size (float): Proportion for training (default 0.7) - val_size (float): Proportion for validation (default 0.1) - test_size (float): Proportion for testing (default 0.2) - random_seed (int): Random seed for reproducibility

SocialED.detector.adpsemevent.get_stable_point(path, if_updata, epsilon)[source]
SocialED.detector.adpsemevent.run_hier_2D_SE_mini_open_set(save_path, n=400, e_a=True, e_s=True, test_with_one_block=True, epsilon=0.2)[source]
SocialED.detector.adpsemevent.run_hier_2D_SE_mini_closed_set(save_path, n=300, e_a=True, e_s=True, epsilon=None)[source]
SocialED.detector.adpsemevent.create_process_open_set(epsilon)[source]
SocialED.detector.adpsemevent.create_process_closed_set(epsilon)[source]
SocialED.detector.adpsemevent.run_processes(epsilons, dataset_name, mode='close')[source]
SocialED.detector.adpsemevent.make_symmetric(matrix)[source]
SocialED.detector.adpsemevent.search_stable_points(embeddings, epsilon, path, max_num_neighbors=200)[source]
SocialED.detector.adpsemevent.get_graph_edges(attributes)[source]
SocialED.detector.adpsemevent.get_knn_edges(epsilon, path, default_num_neighbors)[source]
SocialED.detector.adpsemevent.get_global_edges(attributes, epsilon, folder, default_num_neighbors, e_a=True, e_s=True)[source]
SocialED.detector.adpsemevent.get_subgraphs_edges(clusters, graph_splits, weighted_global_edges)[source]

Get subgraph edges.

Parameters:
  • clusters – a list containing the current clusters, each cluster is a list of nodes of the original graph

  • graph_splits – a list of (start_index, end_index) pairs, each (start_index, end_index) pair indicates a subset of clusters, which will serve as the nodes of a new subgraph

  • weighted_global_edges – a list of (start node, end node, edge weight) tuples, each tuple is an edge in the original graph

Returns:

a list containing the edges of all subgraphs

Return type:

all_subgraphs_edges

SocialED.detector.adpsemevent.get_best_egde(adj_matrix_, subgraphs_, all_subgraphs)[source]
SocialED.detector.adpsemevent.get_best_node(adj_matrix_, subgraphs_, all_subgraphs)[source]
SocialED.detector.adpsemevent.get_subgraphs(adj_matrix, division, n, k_max)[source]
SocialED.detector.adpsemevent.hier_2D_SE_mini(weighted_global_edges, n_messages, n=100)[source]

hierarchical 2D SE minimization

class SocialED.detector.adpsemevent.SE(graph: networkx.Graph)[source]

Bases: object

get_vol()[source]

get the volume of the graph

calc_1dSE()[source]

get the 1D SE of the graph

update_1dSE(original_1dSE, new_edges)[source]

get the updated 1D SE after new edges are inserted into the graph

get_cut(comm)[source]

get the sum of the degrees of the cut edges of community comm

get_volume(comm)[source]

get the volume of community comm

calc_2dSE()[source]

get the 2D SE of the graph

show_division()[source]
show_struc_data()[source]
show_struc_data_2d()[source]
print_graph()[source]
update_struc_data()[source]

calculate the volume, cut, communitiy mode SE, and leaf nodes SE of each cummunity, then store them into self.struc_data

update_struc_data_2d()[source]

calculate the volume, cut, communitiy mode SE, and leaf nodes SE after merging each pair of cummunities, then store them into self.struc_data_2d

init_division()[source]

initialize self.division such that each node assigned to its own community

add_isolates()[source]

add any isolated nodes into graph

update_division_MinSE()[source]

greedily update the encoding tree to minimize 2D SE

SocialED.detector.adpsemevent.vanilla_2D_SE_mini(weighted_edges)[source]

vanilla (greedy) 2D SE minimization

SocialED.detector.adpsemevent.test_vanilla_2D_SE_mini()[source]
SocialED.detector.adpsemevent.replaceAtUser(text)[source]

Replaces “@user” with “”

SocialED.detector.adpsemevent.removeUnicode(text)[source]

Removes unicode strings like “,” and “x96”

SocialED.detector.adpsemevent.replaceURL(text)[source]

Replaces url address with “url”

SocialED.detector.adpsemevent.replaceMultiExclamationMark(text)[source]

Replaces repetitions of exlamation marks

SocialED.detector.adpsemevent.replaceMultiQuestionMark(text)[source]

Replaces repetitions of question marks

SocialED.detector.adpsemevent.removeEmoticons(text)[source]

Removes emoticons from text

SocialED.detector.adpsemevent.removeNewLines(text)[source]
SocialED.detector.adpsemevent.preprocess_sentence(s)[source]
SocialED.detector.adpsemevent.preprocess_french_sentence(s)[source]
SocialED.detector.adpsemevent.SBERT_embed(s_list, language)[source]

Use Sentence-BERT to embed sentences. s_list: a list of sentences/ tokens to be embedded. language: the language of the sentences (‘English’, ‘French’, ‘Arabic’). output: the embeddings of the sentences/ tokens.

SocialED.detector.adpsemevent.evaluate_labels(labels_true, labels_pred)[source]
SocialED.detector.adpsemevent.decode(division)[source]

Hypersed