dataset (object) – The dataset object containing social media data.
Must provide load_data() method that returns the raw data.
n_epochs (int, optional) – Number of training epochs. Default: 1.
n_infer_epochs (int, optional) – Number of inference epochs. Default: 0.
window_size (int, optional) – Size of sliding window for incremental learning. Default: 3.
patience (int, optional) – Number of epochs to wait before early stopping. Default: 5.
margin (float, optional) – Margin for triplet loss. Default: 3.0.
lr (float, optional) – Learning rate. Default: 1e-3.
batch_size (int, optional) – Mini-batch size. Default: 2000.
n_neighbors (int, optional) – Number of neighbors for graph construction. Default: 800.
word_embedding_dim (int, optional) – Dimension of word embeddings. Default: 300.
hidden_dim (int, optional) – Hidden layer dimension. Default: 8.
out_dim (int, optional) – Output dimension. Default: 32.
num_heads (int, optional) – Number of attention heads. Default: 4.
use_residual (bool, optional) – Whether to use residual connections. Default: True.
validation_percent (float, optional) – Percentage of data for validation. Default: 0.1.
test_percent (float, optional) – Percentage of data for testing. Default: 0.2.
use_hardest_neg (bool, optional) – Whether to use hardest negative mining. Default: False.
metrics (str, optional) – Evaluation metric to use. Default: 'ami'.
use_cuda (bool, optional) – Whether to use GPU acceleration. Default: False.
gpuid (int, optional) – ID of GPU to use. Default: 0.
mask_path (str, optional) – Path to attention mask file. Default: None.
log_interval (int, optional) – Number of steps between logging. Default: 10.
is_incremental (bool, optional) – Whether to use incremental learning. Default: False.
mutual (bool, optional) – Whether to use mutual learning. Default: False.
mode (int, optional) – Training mode. Default: 0.
add_mapping (bool, optional) – Whether to add mapping layer. Default: False.
data_path (str, optional) – Path to data directory. Default: '../model/model_saved/clkd/English'.
file_path (str, optional) – Path to save files. Default: '../model/model_saved/clkd'.
Tmodel_path (str, optional) – Path to teacher model. Default: '../model/model_saved/clkd/English/Tmodel/'.
lang (str, optional) – Language of the data. Default: 'French'.
Tealang (str, optional) – Language of teacher model. Default: 'English'.
t (float, optional) – Temperature parameter. Default: 1.
data_path1 (str, optional) – Path to first language data. Default: '../model/model_saved/clkd/English'.
data_path2 (str, optional) – Path to second language data. Default: '../model/model_saved/clkd/French'.
lang1 (str, optional) – First language. Default: 'English'.
lang2 (str, optional) – Second language. Default: 'French'.
e (float, optional) – Epsilon parameter. Default: 0.
mt (float, optional) – Momentum parameter. Default: 0.5.
rd (float, optional) – Random drop rate. Default: 0.1.
is_static (bool, optional) – Whether to use static embeddings. Default: False.
graph_lang (str, optional) – Language for graph construction. Default: 'English'.
tgtlang (str, optional) – Target language. Default: 'French'.
days (int, optional) – Number of days for temporal window. Default: 7.
initial_lang (str, optional) – Initial language. Default: 'French'.
TransLinear (bool, optional) – Whether to use linear transformation. Default: True.
tgt (str, optional) – Target language code. Default: 'English'.
embpath (str, optional) – Path to embedding file. Default: '../model/model_saved/clkd/dictrans/fr-en-for.npy'.
wordpath (str, optional) – Path to word dictionary. Default: '../model/model_saved/clkd/dictrans/wordsFrench.txt'.
SocialED.detector¶
LDA¶
Bases:
objectThe LDA model for social event detection that uses Latent Dirichlet Allocation for topic modeling and event detection.
Note
This detector uses topic modeling to identify events in social media data. The model requires a dataset object with a load_data() method.
See [1] for details.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
num_topics (int, optional) – Number of topics to extract. Default:
50.passes (int, optional) – Number of passes through corpus during training. Default:
20.iterations (int, optional) – Maximum number of iterations through corpus. Default:
50.alpha (str or float, optional) – Prior document-topic distribution. Default:
'symmetric'.eta (float, optional) – Prior topic-word distribution. Default:
None.random_state (int, optional) – Random seed for reproducibility. Default:
1.eval_every (int, optional) – Log perplexity evaluation frequency. Default:
10.chunksize (int, optional) – Number of documents per training chunk. Default:
2000.file_path (str, optional) – Path to save model files. Default:
'../model/model_saved/LDA/'.Data preprocessing: tokenization, stop words removal, etc.
Create corpus and dictionary required for LDA model.
Load the LDA model from a file.
Display topics generated by the LDA model.
Assign topics to each document and save unique ground truths and predictions to a CSV file.
Evaluate the model.
BiLSTM¶
Bases:
objectThe BiLSTM model for social event detection that uses bidirectional LSTM to detect events in social media data.
Note
This detector uses bidirectional LSTM to identify events in social media data. The model requires a dataset object with a load_data() method.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
lr (float, optional) – Learning rate for optimizer. Default:
1e-3.batch_size (int, optional) – Batch size for training. Default:
1000.dropout_keep_prob (float, optional) – Dropout keep probability. Default:
0.8.embedding_size (int, optional) – Size of word embeddings. Default:
300.max_size (int, optional) – Maximum vocabulary size. Default:
5000.seed (int, optional) – Random seed for reproducibility. Default:
1.num_hidden_nodes (int, optional) – Number of LSTM hidden nodes. Default:
32.hidden_dim2 (int, optional) – Size of second hidden layer. Default:
64.num_layers (int, optional) – Number of LSTM layers. Default:
1.bi_directional (bool, optional) – Whether to use bidirectional LSTM. Default:
True.pad_index (int, optional) – Index used for padding. Default:
0.num_epochs (int, optional) – Number of training epochs. Default:
20.margin (int, optional) – Margin for triplet loss. Default:
3.max_len (int, optional) – Maximum sequence length. Default:
10.file_path (str, optional) – Path to save model files. Default:
'../model/model_saved/Bilstm/'.Data preprocessing: tokenization, stop words removal, etc.
Split the dataset into training, validation, and test sets.
Load pre-trained word embeddings.
Train the BiLSTM model.
Evaluate the model.
Run the training and evaluation process for the BiLSTM model.
Fit the model on the training data and save the best model.
Detect events using the best trained model on the test data.
Bases:
ModuleBases:
DatasetBases:
ModuleOnline Triplets loss Takes a batch of embeddings and corresponding labels. Triplets are generated using triplet_selector object that take embeddings and targets and return indices of triplets
Bases:
objectImplementation should return indices of anchors, positive and negative samples return np array of shape [N_triplets x 3]
Bases:
TripletSelectorFor each positive pair, takes the hardest negative sample (with the greatest triplet loss value) to create a triplet Margin should match the margin used in triplet loss. negative_selection_fn should take array of loss_values for a given anchor-positive pair and all negative samples and return a negative index for that pair
Word2Vec¶
Bases:
objectThe Word2Vec model for social event detection that uses word embeddings to detect events in social media data.
Note
This detector uses word embeddings to identify semantic relationships and detect events in social media data. The model requires a dataset object with a load_data() method.
See [2] for details.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
vector_size (int, optional) – Dimensionality of word vectors. Default:
100.window (int, optional) – Maximum distance between current and predicted word. Default:
5.min_count (int, optional) – Minimum word frequency. Default:
1.sg (int, optional) – Training algorithm: Skip-gram (1) or CBOW (0). Default:
1.file_path (str, optional) – Path to save model files. Default:
'../model/model_saved/Word2vec/word2vec_model.model'.Data preprocessing: tokenization, stop words removal, etc.
Train the Word2Vec model and save it to a file.
Load the Word2Vec model from a file.
Create a document vector by averaging the Word2Vec embeddings of its words.
Detect events by representing each document as the average Word2Vec embedding of its words.
Evaluate the model.
GloVe¶
Bases:
objectThe GloVe model for social event detection that uses GloVe word embeddings to detect events in social media data.
Note
This detector uses word embeddings to identify events in social media data. The model requires a dataset object with a load_data() method.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
num_clusters (int, optional) – Number of clusters for KMeans clustering. Default:
50.random_state (int, optional) – Random seed for reproducibility. Default:
1.file_path (str, optional) – Path to save model files. Default:
'../model/model_saved/GloVe/'.model (str, optional) – Path to pre-trained GloVe word vectors file. Default:
'../model/model_needed/glove.6B.100d.txt'.Load GloVe pre-trained word vectors.
Data preprocessing: tokenization, stop words removal, etc.
Convert text to GloVe vector representation.
Create GloVe vectors for each document.
Load the KMeans model from a file.
Assign clusters to each document.
Evaluate the model.
WMD¶
Bases:
objectThe WMD model for social event detection that uses Word Mover’s Distance to measure document similarity and detect events.
Note
This detector uses word embeddings and Word Mover’s Distance to identify similar documents and detect events in social media data. The model requires a dataset object with a load_data() method.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
vector_size (int, optional) – Dimensionality of word vectors. Default:
100.window (int, optional) – Maximum distance between current and predicted word. Default:
5.min_count (int, optional) – Minimum word frequency. Default:
1.sg (int, optional) – Training algorithm: Skip-gram (1) or CBOW (0). Default:
1.num_best (int, optional) – Number of best matches to return. Default:
5.threshold (float, optional) – Similarity threshold for event detection. Default:
0.6.batch_size (int, optional) – Batch size for processing. Default:
1000.n_workers (int, optional) – Number of worker processes. Default:
CPU count - 1.file_path (str, optional) – Path to save model files. Default:
'../model/model_saved/WMD/'.优化的数据预处理
Train the Word2Vec model and save it to a file.
优化的事件检测
保存结果的辅助方法
Evaluate the model and save results.
Bert¶
Bases:
objectThe BERT model for social event detection that uses BERT embeddings to detect events in social media data.
Note
This detector uses BERT embeddings to identify events in social media data. The model requires a dataset object with a load_data() method.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
model_name (str, optional) – Path to pretrained BERT model or name from HuggingFace. If path doesn’t exist, defaults to ‘bert-base-uncased’. Default:
'../model/model_needed/bert-base-uncased'.max_length (int, optional) – Maximum sequence length for BERT tokenizer. Longer sequences will be truncated. Default:
128.df (pandas.DataFrame, optional) – Preprocessed dataframe. If None, will be created during preprocessing. Default:
None.train_df (pandas.DataFrame, optional) – Training data split. If None, will be created during model fitting. Default:
None.test_df (pandas.DataFrame, optional) – Test data split. If None, will be created during model fitting. Default:
None.Data preprocessing: tokenization, stop words removal, etc.
Get BERT embeddings for a given text.
Detect events by comparing BERT embeddings.
Evaluate the BERT-based model.
SBert¶
Bases:
objectThe SBERT model for social event detection that uses Sentence-BERT for text embedding and event detection.
Note
This detector uses Sentence-BERT to generate text embeddings for identifying events in social media data. The model requires a dataset object with a load_data() method.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
model_name (str, optional) – Path or name of the SBERT model to use. Default:
'../model/model_needed/paraphrase-MiniLM-L6-v2'df (pandas.DataFrame, optional) – Processed dataframe. Default:
Nonetrain_df (pandas.DataFrame, optional) – Training dataframe. Default:
Nonetest_df (pandas.DataFrame, optional) – Test dataframe. Default:
NoneData preprocessing: tokenization, stop words removal, etc.
Get SBERT embeddings for a given text.
Detect events by comparing SBERT embeddings.
Evaluate the model.
EventX¶
Bases:
objectThe EventX model for social event detection that extracts events from breaking news using keyword co-occurrence and graph-based clustering.
Note
This detector uses keyword co-occurrence and graph-based clustering to identify events in social media data. The model requires a dataset object with a load_data() method.
See [3] for details.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
file_path (str, optional) – Path to save model files. Default:
'../model/model_saved/eventX/'.num_repeats (int, optional) – Number of times to repeat keyword extraction. Default:
5.min_cooccur_time (int, optional) – Minimum number of times keywords must co-occur. Default:
2.min_prob (float, optional) – Minimum probability threshold for keyword selection. Default:
0.15.max_kw_num (int, optional) – Maximum number of keywords to extract per document. Default:
3.Split the dataset into training, validation, and test sets.
Evaluate the model.
CLKD¶
Bases:
objectThe CLKD (Contrastive Learning with Knowledge Distillation) model for social event detection.
Note
This detector uses contrastive learning and knowledge distillation to identify events in social media data. The model requires a dataset object with a load_data() method.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
n_epochs (int, optional) – Number of training epochs. Default:
1.n_infer_epochs (int, optional) – Number of inference epochs. Default:
0.window_size (int, optional) – Size of sliding window for incremental learning. Default:
3.patience (int, optional) – Number of epochs to wait before early stopping. Default:
5.margin (float, optional) – Margin for triplet loss. Default:
3.0.lr (float, optional) – Learning rate. Default:
1e-3.batch_size (int, optional) – Mini-batch size. Default:
2000.n_neighbors (int, optional) – Number of neighbors for graph construction. Default:
800.word_embedding_dim (int, optional) – Dimension of word embeddings. Default:
300.hidden_dim (int, optional) – Hidden layer dimension. Default:
8.out_dim (int, optional) – Output dimension. Default:
32.num_heads (int, optional) – Number of attention heads. Default:
4.use_residual (bool, optional) – Whether to use residual connections. Default:
True.validation_percent (float, optional) – Percentage of data for validation. Default:
0.1.test_percent (float, optional) – Percentage of data for testing. Default:
0.2.use_hardest_neg (bool, optional) – Whether to use hardest negative mining. Default:
False.metrics (str, optional) – Evaluation metric to use. Default:
'ami'.use_cuda (bool, optional) – Whether to use GPU acceleration. Default:
False.gpuid (int, optional) – ID of GPU to use. Default:
0.mask_path (str, optional) – Path to attention mask file. Default:
None.log_interval (int, optional) – Number of steps between logging. Default:
10.is_incremental (bool, optional) – Whether to use incremental learning. Default:
False.mutual (bool, optional) – Whether to use mutual learning. Default:
False.mode (int, optional) – Training mode. Default:
0.add_mapping (bool, optional) – Whether to add mapping layer. Default:
False.data_path (str, optional) – Path to data directory. Default:
'../model/model_saved/clkd/English'.file_path (str, optional) – Path to save files. Default:
'../model/model_saved/clkd'.Tmodel_path (str, optional) – Path to teacher model. Default:
'../model/model_saved/clkd/English/Tmodel/'.lang (str, optional) – Language of the data. Default:
'French'.Tealang (str, optional) – Language of teacher model. Default:
'English'.t (float, optional) – Temperature parameter. Default:
1.data_path1 (str, optional) – Path to first language data. Default:
'../model/model_saved/clkd/English'.data_path2 (str, optional) – Path to second language data. Default:
'../model/model_saved/clkd/French'.lang1 (str, optional) – First language. Default:
'English'.lang2 (str, optional) – Second language. Default:
'French'.e (float, optional) – Epsilon parameter. Default:
0.mt (float, optional) – Momentum parameter. Default:
0.5.rd (float, optional) – Random drop rate. Default:
0.1.is_static (bool, optional) – Whether to use static embeddings. Default:
False.graph_lang (str, optional) – Language for graph construction. Default:
'English'.tgtlang (str, optional) – Target language. Default:
'French'.days (int, optional) – Number of days for temporal window. Default:
7.initial_lang (str, optional) – Initial language. Default:
'French'.TransLinear (bool, optional) – Whether to use linear transformation. Default:
True.tgt (str, optional) – Target language code. Default:
'English'.embpath (str, optional) – Path to embedding file. Default:
'../model/model_saved/clkd/dictrans/fr-en-for.npy'.wordpath (str, optional) – Path to word dictionary. Default:
'../model/model_saved/clkd/dictrans/wordsFrench.txt'.Bases:
objectBases:
objectBases:
MetricWorks with classification model
Bases:
MetricCounts average number of nonzero triplets found in minibatches
Bases:
ModuleOnline Triplets loss Takes a batch of embeddings and corresponding labels. Triplets are generated using triplet_selector object that take embeddings and targets and return indices of triplets
Bases:
objectImplementation should return indices of anchors, positive and negative samples return np array of shape [N_triplets x 3]
Bases:
TripletSelectorFor each positive pair, takes the hardest negative sample (with the greatest triplet loss value) to create a triplet Margin should match the margin used in triplet loss. negative_selection_fn should take array of loss_values for a given anchor-positive pair and all negative samples and return a negative index for that pair
Bases:
ModuleReinitialize learnable parameters.
Bases:
ModuleBases:
ModuleBases:
objectBases:
DatasetKPGNN¶
Bases:
objectThe KPGNN model for social event detection that uses knowledge-preserving graph neural networks for event detection.
Note
This detector uses graph neural networks with knowledge preservation to identify events in social media data. The model requires a dataset object with a load_data() method.
See [4] for details.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
n_epochs (int, optional) – Number of training epochs. Default:
15.n_infer_epochs (int, optional) – Number of inference epochs. Default:
0.window_size (int, optional) – Size of sliding window. Default:
3.patience (int, optional) – Early stopping patience. Default:
5.margin (float, optional) – Margin for triplet loss. Default:
3.0.lr (float, optional) – Learning rate for optimizer. Default:
1e-3.batch_size (int, optional) – Batch size for training. Default:
200.n_neighbors (int, optional) – Number of neighbors to sample. Default:
800.hidden_dim (int, optional) – Hidden layer dimension. Default:
8.out_dim (int, optional) – Output dimension. Default:
32.num_heads (int, optional) – Number of attention heads. Default:
4.use_residual (bool, optional) – Whether to use residual connections. Default:
True.validation_percent (float, optional) – Percentage of data for validation. Default:
0.2.use_hardest_neg (bool, optional) – Whether to use hardest negative mining. Default:
False.use_dgi (bool, optional) – Whether to use deep graph infomax. Default:
False.remove_obsolete (int, optional) – Number of epochs before removing obsolete data. Default:
2.is_incremental (bool, optional) – Whether to use incremental learning. Default:
False.use_cuda (bool, optional) – Whether to use GPU acceleration. Default:
False.data_path (str, optional) – Path to save model data. Default:
'../model/model_saved/kpgnn/kpgnn_incremental_test'.mask_path (str, optional) – Path to mask file. Default:
None.resume_path (str, optional) – Path to resume training from. Default:
None.resume_point (int, optional) – Epoch to resume from. Default:
0.resume_current (bool, optional) – Whether to resume from current state. Default:
True.log_interval (int, optional) – Number of steps between logging. Default:
10.Bases:
objectBases:
objectIntro: This function generates train and validation indices for initial/maintenance epochs and test indices for inference(prediction) epochs If remove_obsolete mode 0 or 1: For initial/maintenance epochs: - The first (train_i + 1) blocks (blocks 0, …, train_i) are used as training set (with explicit labels) - Randomly sample validation_percent of the training indices as validation indices For inference(prediction) epochs: - The (i + 1)th block (block i) is used as test set Note that other blocks (block train_i + 1, …, i - 1) are also in the graph (without explicit labels, only their features and structural info are leveraged) If remove_obsolete mode 2: For initial/maintenance epochs: - The (i + 1) = (train_i + 1)th block (block train_i = i) is used as training set (with explicit labels) - Randomly sample validation_percent of the training indices as validation indices For inference(prediction) epochs: - The (i + 1)th block (block i) is used as test set
length – the length of label list
data_split – loaded splited data (generated in custom_message_graph.py)
i (train_i,) – flag, indicating for initial/maintenance stage if train_i == i and inference stage for others
validation_percent – the percent of validation data occupied in whole dataset
save_path – path to save data
num_indices_to_remove – number of indices ought to be removed
:returns train indices, validation indices or test indices
Bases:
objectBases:
MetricWorks with classification model
Bases:
MetricCounts average number of nonzero triplets found in minibatches
Bases:
ModuleReinitialize learnable parameters.
Bases:
ModuleBases:
ModuleBases:
ModuleBases:
ModuleBases:
ModuleBases:
ModuleOnline Triplets loss Takes a batch of embeddings and corresponding labels. Triplets are generated using triplet_selector object that take embeddings and targets and return indices of triplets
Bases:
objectImplementation should return indices of anchors, positive and negative samples return np array of shape [N_triplets x 3]
Bases:
TripletSelectorFor each positive pair, takes the hardest negative sample (with the greatest triplet loss value) to create a triplet Margin should match the margin used in triplet loss. negative_selection_fn should take array of loss_values for a given anchor-positive pair and all negative samples and return a negative index for that pair
Bases:
DatasetFinEvent¶
Bases:
objectThe FinEvent model for social event detection that uses graph neural networks and reinforcement learning for adaptive event detection.
Note
This detector uses graph neural networks and reinforcement learning to identify events in social media data. The model requires a dataset object with a load_data() method.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
n_epochs (int, optional) – Number of training epochs. Default:
1.window_size (int, optional) – Size of sliding window for incremental learning. Default:
3.patience (int, optional) – Number of epochs to wait before early stopping. Default:
5.margin (float, optional) – Margin for triplet loss. Default:
3.0.lr (float, optional) – Learning rate. Default:
1e-3.batch_size (int, optional) – Mini-batch size. Default:
50.hidden_dim (int, optional) – Hidden layer dimension. Default:
128.out_dim (int, optional) – Output dimension. Default:
64.heads (int, optional) – Number of attention heads. Default:
4.validation_percent (float, optional) – Percentage of data for validation. Default:
0.2.use_hardest_neg (bool, optional) – Whether to use hardest negative mining. Default:
False.is_shared (bool, optional) – Whether to use shared parameters. Default:
False.inter_opt (str, optional) – Integration option for multi-view features. Default:
'cat_w_avg'.is_initial (bool, optional) – Whether to initialize model. Default:
True.sampler (str, optional) – Type of sampler to use. Default:
'RL_sampler'.cluster_type (str, optional) – Clustering algorithm to use. Default:
'kmeans'.threshold_start0 (list, optional) – Initial thresholds for RL-0. Default:
[[0.2], [0.2], [0.2]].RL_step0 (float, optional) – Step size for RL-0. Default:
0.02.RL_start0 (int, optional) – Starting point for RL-0. Default:
0.eps_start (float, optional) – Initial epsilon for RL-1. Default:
0.001.eps_step (float, optional) – Step size for epsilon in RL-1. Default:
0.02.min_Pts_start (int, optional) – Initial minimum points for RL-1. Default:
2.min_Pts_step (int, optional) – Step size for minimum points in RL-1. Default:
1.use_cuda (bool, optional) – Whether to use GPU acceleration. Default:
True.data_path (str, optional) – Path to data directory. Default:
'../model/model_saved/finevent/incremental_test/'.file_path (str, optional) – Path to save model files. Default:
'../model/model_saved/finevent/'.mask_path (str, optional) – Path to attention mask file. Default:
None.log_interval (int, optional) – Number of steps between logging. Default:
10.Bases:
FinEventi –
data_split –
metrics –
embedding_save_path –
loss_fn –
model –
loss_fn_dgi –
Bases:
objectIntro: This function generates train and validation indices for initial/maintenance epochs and test indices for inference(prediction) epochs If remove_obsolete mode 0 or 1: For initial/maintenance epochs: - The first (train_i + 1) blocks (blocks 0, …, train_i) are used as training set (with explicit labels) - Randomly sample validation_percent of the training indices as validation indices For inference(prediction) epochs: - The (i + 1)th block (block i) is used as test set Note that other blocks (block train_i + 1, …, i - 1) are also in the graph (without explicit labels, only their features and structural info are leveraged) If remove_obsolete mode 2: For initial/maintenance epochs: - The (i + 1) = (train_i + 1)th block (block train_i = i) is used as training set (with explicit labels) - Randomly sample validation_percent of the training indices as validation indices For inference(prediction) epochs: - The (i + 1)th block (block i) is used as test set
length – the length of label list
data_split – loaded splited data (generated in custom_message_graph.py)
i (train_i,) – flag, indicating for initial/maintenance stage if train_i == i and inference stage for others
validation_percent – the percent of validation data occupied in whole dataset
save_path – path to save data
num_indices_to_remove – number of indices ought to be removed
:returns train indices, validation indices or test indices
Bases:
objectBases:
objectBases:
MetricWorks with classification model
Bases:
MetricCounts average number of nonzero triplets found in minibatches
Bases:
ModuleBases:
Moduleadopt this module when using mini-batch
Bases:
ModuleBases:
ModuleThis is used to culculate the similarity between node and its neighbors in advance in order to avoid the repetitive computation.
multi_r_data ([type]) – [description]
features ([type]) – [description]
save_path ([type], optional) – [description]. Defaults to None.
Bases:
ModuleBases:
ModuleBases:
ModuleOnline Triplets loss Takes a batch of embeddings and corresponding labels. Triplets are generated using triplet_selector object that take embeddings and targets and return indices of triplets.
Bases:
objectImplementation should return indices of anchors, positive and negative samples return np array of shape [N_triplets x 3]
Bases:
TripletSelectorFor each positive pair, takes the hardest negative sample (with the greatest triplet loss value) to create a triplet Margin should match the margin used in triplet loss. negative_selection_fn should take array of loss_values for a given anchor-positive pair and all negative samples and return a negative index for that pair
QSGNN¶
Bases:
objectThe QSGNN model for social event detection that uses a query-based streaming graph neural network for event detection.
Note
This detector uses graph neural networks with query-based streaming to identify events in social media data. The model requires a dataset object with a load_data() method.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
finetune_epochs (int, optional) – Number of fine-tuning epochs. Default:
1.n_epochs (int, optional) – Number of training epochs. Default:
5.oldnum (int, optional) – Number of old classes. Default:
20.novelnum (int, optional) – Number of novel classes. Default:
20.n_infer_epochs (int, optional) – Number of inference epochs. Default:
0.window_size (int, optional) – Size of sliding window. Default:
3.patience (int, optional) – Early stopping patience. Default:
5.margin (float, optional) – Margin for triplet loss. Default:
3.0.a (float, optional) – Scaling factor. Default:
8.0.lr (float, optional) – Learning rate for optimizer. Default:
1e-3.batch_size (int, optional) – Batch size for training. Default:
1000.n_neighbors (int, optional) – Number of neighbors to sample. Default:
1200.word_embedding_dim (int, optional) – Word embedding dimension. Default:
300.hidden_dim (int, optional) – Hidden layer dimension. Default:
16.out_dim (int, optional) – Output dimension. Default:
64.num_heads (int, optional) – Number of attention heads. Default:
4.use_residual (bool, optional) – Whether to use residual connections. Default:
True.validation_percent (float, optional) – Percentage of data for validation. Default:
0.1.test_percent (float, optional) – Percentage of data for testing. Default:
0.2.use_hardest_neg (bool, optional) – Whether to use hardest negative mining. Default:
True.metrics (str, optional) – Evaluation metric to use. Default:
'nmi'.use_cuda (bool, optional) – Whether to use GPU acceleration. Default:
True.add_ort (bool, optional) – Whether to add orthogonal regularization. Default:
True.gpuid (int, optional) – GPU device ID to use. Default:
0.mask_path (str, optional) – Path to mask file. Default:
None.log_interval (int, optional) – Number of steps between logging. Default:
10.is_incremental (bool, optional) – Whether to use incremental learning. Default:
True.data_path (str, optional) – Path to save model data. Default:
'../model/model_saved/qsgnn/English'.file_path (str, optional) – Path to save model files. Default:
'../model/model_saved/qsgnn'.add_pair (bool, optional) – Whether to add pair-wise constraints. Default:
False.initial_lang (str, optional) – Initial language for processing. Default:
'English'.is_static (bool, optional) – Whether to use static graph. Default:
False.graph_lang (str, optional) – Language for graph construction. Default:
'English'.days (int, optional) – Number of days for temporal window. Default:
2.Bases:
QSGNNBases:
DatasetBases:
objectBases:
ModuleBases:
ModuleBases:
ModuleReinitialize learnable parameters.
Bases:
ModuleBases:
ModuleBases:
ModuleOnline Triplets loss Takes a batch of embeddings and corresponding labels. Triplets are generated using triplet_selector object that take embeddings and targets and return indices of triplets
Bases:
objectImplementation should return indices of anchors, positive and negative samples return np array of shape [N_triplets x 3]
Bases:
TripletSelectorFor each positive pair, takes the hardest negative sample (with the greatest triplet loss value) to create a triplet Margin should match the margin used in triplet loss. negative_selection_fn should take array of loss_values for a given anchor-positive pair and all negative samples and return a negative index for that pair
Bases:
objectBases:
MetricWorks with classification model
Bases:
MetricCounts average number of nonzero triplets found in minibatches
HCRC¶
Bases:
objectThe HCRC model for social event detection that uses hierarchical clustering and reinforcement learning for adaptive event detection.
Note
This detector uses hierarchical clustering and reinforcement learning to adaptively detect events in social media data. The model requires a dataset object with a load_data() method.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
file_path (str, optional) – Path to save model files. Default:
'../model/model_saved/hcrc/'.result_path (str, optional) – Path to save results file. Default:
'../model/model_saved/hcrc/res.txt'.task (str, optional) – Task type, e.g. ‘DRL’ for deep reinforcement learning. Default:
'DRL'.layers (str, optional) – Hidden layer dimensions as string. Default:
'[256]'.N_pred_hid (int, optional) – Node prediction hidden dimension. Default:
64.G_pred_hid (int, optional) – Graph prediction hidden dimension. Default:
16.eval_freq (float, optional) – Evaluation frequency. Default:
5.mad (float, optional) – Moving average decay rate. Default:
0.9.Glr (float, optional) – Learning rate for graph model. Default:
0.0000006.Nlr (float, optional) – Learning rate for node model. Default:
0.00001.Ges (int, optional) – Graph model early stopping patience. Default:
50.Nes (int, optional) – Node model early stopping patience. Default:
2000.Gepochs (int, optional) – Number of graph model training epochs. Default:
105.Nepochs (int, optional) – Number of node model training epochs. Default:
100.device (int, optional) – GPU device ID to use. Default:
0.Bases:
objectBases:
objectBases:
ModuleBases:
objectBases:
embedderBases:
ModuleBases:
embedderBases:
ModuleBases:
objectBases:
objectBases:
objectBases:
objectBases:
ModuleBases:
ModuleBases:
ModuleBases:
ModuleBases:
objectclear the buffer :return:
add a transition in buffer :return:
UCLSED¶
Bases:
objectThe UCLSED model for social event detection that uses uncertainty-aware contrastive learning for event detection.
dataset (*) – The dataset object containing social media data. The dataset should provide methods: - load_data(): Returns the raw data - get_dataset_language(): Returns the language of the dataset
file_path (*) – Path to save model files. (default: ‘../model/model_saved/uclsed/’)
epoch (*) – Number of training epochs. (default: 50)
batch_size (*) – Batch size for training. (default: 128)
neighbours_num (*) – Number of neighbors to sample. (default: 80)
GNN_h_dim (*) – Hidden dimension of GNN. (default: 256)
GNN_out_dim (*) – Output dimension of GNN. (default: 256)
E_h_dim (*) – Hidden dimension of encoder. (default: 128)
use_uncertainty (*) – Whether to use uncertainty estimation. (default: True)
use_cuda (*) – Whether to use GPU acceleration. (default: True)
gpuid (*) – GPU device ID to use. (default: 0)
mode (*) – Training mode. (default: 0)
mse (*) – Whether to use MSE loss. (default: False)
digamma (*) – Whether to use digamma function. (default: True)
log (*) – Whether to use log transformation. (default: False)
learning_rate (*) – Learning rate for optimizer. (default: 1e-4)
weight_decay (*) – Weight decay for optimizer. (default: 1e-5)
Bases:
objectBases:
ModuleBases:
ModuleBases:
ModuleBases:
Modulealpha – All Dirichlet distribution parameters.
Combined Dirichlet distribution parameters.
RPLMSED¶
Bases:
objectThe RPLMSED model for social event detection that uses pre-trained language models with prompt learning for event detection.
Note
This detector uses prompt learning with pre-trained language models to identify events in social media data. The model requires a dataset object with a load_data() method.
dataset (object) – The dataset object containing social media data. Must provide load_data() method that returns the raw data.
plm_path (str, optional) – Path to pre-trained language model. Default:
'../model/model_needed/base_plm_model/roberta-large'.file_path (str, optional) – Path to save model files. Default:
'../model/model_saved/rplmsed/'.plm_tuning (bool, optional) – Whether to fine-tune PLM. Default:
False.use_ctx_att (bool, optional) – Whether to use context attention. Default:
False.offline (bool, optional) – Whether to use offline mode. Default:
True.ctx_att_head_num (int, optional) – Number of context attention heads. Default:
2.pmt_feats (tuple, optional) – Prompt feature indices to use. Default:
(0,1,2,4).batch_size (int, optional) – Batch size for training. Default:
128.lmda1 (float, optional) – Lambda 1 hyperparameter. Default:
0.010.lmda2 (float, optional) – Lambda 2 hyperparameter. Default:
0.005.tao (float, optional) – Temperature parameter. Default:
0.90.optimizer (str, optional) – Optimizer to use. Default:
'Adam'.lr (float, optional) – Learning rate. Default:
2e-5.weight_decay (float, optional) – Weight decay for optimizer. Default:
1e-5.momentum (float, optional) – Momentum for optimizer. Default:
0.9.step_lr_gamma (float, optional) – Learning rate decay factor. Default:
0.98.max_epochs (int, optional) – Maximum training epochs. Default:
1.ckpt_path (str, optional) – Path to save checkpoints. Default:
'../model/model_saved/rplmsed/ckpt/'.eva_data (str, optional) – Path to evaluation data. Default:
'../model/model_saved/rplmsed/Eva_data/'.early_stop_patience (int, optional) – Early stopping patience. Default:
2.early_stop_monitor (str, optional) – Metric to monitor for early stopping. Default:
'loss'.SAMPLE_NUM_TWEET (int, optional) – Number of tweets to sample. Default:
60.WINDOW_SIZE (int, optional) – Size of sliding window. Default:
3.device (str, optional) – Device to use for computation. Default:
"cuda:0" if available else "cpu".Bases:
tupleAlias for field number 7
Alias for field number 5
Alias for field number 2
Alias for field number 4
Alias for field number 9
Alias for field number 1
Alias for field number 0
Alias for field number 8
Alias for field number 6
Alias for field number 10
Alias for field number 3
Bases:
objectBases:
objectpadding or truncate sequence to fixed length :param seq: input sequence :param max_len: max length :param pad: padding token id :param pad_left: pad on left :return: padded sequence
Bases:
ModuleThe class is an implementation of the paper A Structured Self-Attentive Sentence Embedding
Initializes parameters suggested in paper :param feat_dim: {int} hidden dimension for lstm :param hid_dim: {int} hidden dimension for the dense layer :param att_head_num: {int} attention-hops or attention heads
self
Exception –
inpt – [len, bsz, dim]
mask – [len, bsz]
[bsz, head_num, dim], [bsz, head_num, len]
Bases:
ModuleHISEvent¶
Bases:
objectHISEvent class for event detection.
This class implements hierarchical structure-based event detection.
dataset – Input dataset
... –
Evaluate the model.
Bases:
objectInitialize preprocessor :param dataset: Dataset calss (e.g. Event2012, Event2018, etc.) :param language: Language of the dataset (default ‘English’) :param mode: ‘open’ or ‘close’ (default ‘close’) - determines preprocessing mode
Get closed set test dataframe
Get SBERT embeddings for closed set messages
Get SBERT embeddings for open set messages
Split data into open set blocks
Main preprocessing function
Splits the DataFrame into training, validation, and test sets, and saves the indices (masks) as .pt files.
Parameters: - df (pd.DataFrame): The DataFrame to be split - save_dir (str): Directory to save the masks - train_size (float): Proportion for training (default 0.7) - val_size (float): Proportion for validation (default 0.1) - test_size (float): Proportion for testing (default 0.2) - random_seed (int): Random seed for reproducibility
Get subgraph edges.
clusters – a list containing the current clusters, each cluster is a list of nodes of the original graph
graph_splits – a list of (start_index, end_index) pairs, each (start_index, end_index) pair indicates a subset of clusters, which will serve as the nodes of a new subgraph
weighted_global_edges – a list of (start node, end node, edge weight) tuples, each tuple is an edge in the original graph
a list containing the edges of all subgraphs
all_subgraphs_edges
hierarchical 2D SE minimization
Bases:
objectget the volume of the graph
get the 1D SE of the graph
get the updated 1D SE after new edges are inserted into the graph
get the sum of the degrees of the cut edges of community comm
get the volume of community comm
get the 2D SE of the graph
calculate the volume, cut, communitiy mode SE, and leaf nodes SE of each cummunity, then store them into self.struc_data
calculate the volume, cut, communitiy mode SE, and leaf nodes SE after merging each pair of cummunities, then store them into self.struc_data_2d
initialize self.division such that each node assigned to its own community
add any isolated nodes into graph
greedily update the encoding tree to minimize 2D SE
vanilla (greedy) 2D SE minimization
Replaces “@user” with “”
Removes unicode strings like “,” and “x96”
Replaces url address with “url”
Replaces repetitions of exlamation marks
Replaces repetitions of question marks
Removes emoticons from text
Use Sentence-BERT to embed sentences. s_list: a list of sentences/ tokens to be embedded. language: the language of the sentences (‘English’, ‘French’, ‘Arabic’). output: the embeddings of the sentences/ tokens.
ADPSEMEvent¶
Bases:
objectADPSEMEvent class for event detection.
This class implements adaptive semantic event detection.
dataset – Input dataset
... –
Evaluate the model.
Bases:
objectInitialize preprocessor :param dataset: Dataset calss (e.g. Event2012, Event2018, etc.) :param language: Language of the dataset (default ‘English’) :param mode: ‘open’ or ‘close’ (default ‘close’) - determines preprocessing mode
Get closed set test dataframe
Get SBERT embeddings for closed set messages
Get SBERT embeddings for open set messages
Split data into open set blocks
Main preprocessing function
Splits the DataFrame into training, validation, and test sets, and saves the indices (masks) as .pt files.
Parameters: - df (pd.DataFrame): The DataFrame to be split - save_dir (str): Directory to save the masks - train_size (float): Proportion for training (default 0.7) - val_size (float): Proportion for validation (default 0.1) - test_size (float): Proportion for testing (default 0.2) - random_seed (int): Random seed for reproducibility
Get subgraph edges.
clusters – a list containing the current clusters, each cluster is a list of nodes of the original graph
graph_splits – a list of (start_index, end_index) pairs, each (start_index, end_index) pair indicates a subset of clusters, which will serve as the nodes of a new subgraph
weighted_global_edges – a list of (start node, end node, edge weight) tuples, each tuple is an edge in the original graph
a list containing the edges of all subgraphs
all_subgraphs_edges
hierarchical 2D SE minimization
Bases:
objectget the volume of the graph
get the 1D SE of the graph
get the updated 1D SE after new edges are inserted into the graph
get the sum of the degrees of the cut edges of community comm
get the volume of community comm
get the 2D SE of the graph
calculate the volume, cut, communitiy mode SE, and leaf nodes SE of each cummunity, then store them into self.struc_data
calculate the volume, cut, communitiy mode SE, and leaf nodes SE after merging each pair of cummunities, then store them into self.struc_data_2d
initialize self.division such that each node assigned to its own community
add any isolated nodes into graph
greedily update the encoding tree to minimize 2D SE
vanilla (greedy) 2D SE minimization
Replaces “@user” with “”
Removes unicode strings like “,” and “x96”
Replaces url address with “url”
Replaces repetitions of exlamation marks
Replaces repetitions of question marks
Removes emoticons from text
Use Sentence-BERT to embed sentences. s_list: a list of sentences/ tokens to be embedded. language: the language of the sentences (‘English’, ‘French’, ‘Arabic’). output: the embeddings of the sentences/ tokens.
Hypersed¶