SocialED.utils¶ utility¶ A set of utility functions to support social event detection tasks. SocialED.utils.utility.construct_graph(df, G=None)[source]¶ Construct a graph from a DataFrame containing social media data. Parameters: df (pandas.DataFrame) – DataFrame containing social media data with columns: tweet_id, user_mentions, user_id, entities, sampled_words G (networkx.Graph, optional (default=None)) – Existing graph to add nodes/edges to. If None, creates new graph. Returns: G – Graph with nodes for tweets, users, entities and words, and edges between them. Return type: networkx.Graph SocialED.utils.utility.tokenize_text(text, max_length=512)[source]¶ Tokenize text for social event detection tasks. Parameters: text (str) – The input text to tokenize. max_length (int, optional (default=512)) – Maximum length of tokenized sequence. Returns: tokens – List of tokenized words/subwords. Return type: list SocialED.utils.utility.pprint(params, offset=0, printer=<built-in function repr>)[source]¶ Pretty print the dictionary ‘params’. Parameters: params (dict) – The dictionary to pretty print offset (int, optional (default=0)) – The offset at the beginning of each line printer (callable, optional (default=repr)) – The function to convert entries to strings Returns: Pretty printed string representation Return type: str SocialED.utils.utility.validate_device(gpu_id)[source]¶ Validate the input GPU ID is valid on the given environment. If no GPU is presented, return ‘cpu’. Parameters: gpu_id (int) – GPU ID to check. Returns: device – Valid device, e.g., ‘cuda:0’ or ‘cpu’. Return type: str SocialED.utils.utility.check_parameter(value, lower, upper, param_name, include_left=True, include_right=True)[source]¶ Check if a parameter value is within specified bounds. Parameters: value (int or float) – The parameter value to check lower (int or float) – Lower bound upper (int or float) – Upper bound param_name (str) – Name of the parameter for error messages include_left (bool, optional (default=True)) – Whether to include lower bound in valid range include_right (bool, optional (default=True)) – Whether to include upper bound in valid range Returns: True if parameter is valid, raises ValueError otherwise Return type: bool SocialED.utils.utility.currentTime()[source]¶ Get current time as formatted string. Returns: Current time in format ‘YYYY-MM-DD HH:MM:SS’ Return type: str SocialED.utils.utility.sim(z1, z2)[source]¶ Compute cosine similarity between two sets of vectors. Parameters: z1 (torch.Tensor) – First set of vectors z2 (torch.Tensor) – Second set of vectors Returns: Similarity matrix Return type: torch.Tensor SocialED.utils.utility.pairwise_sample(embeddings, labels=None, model=None)[source]¶ SocialED.utils.utility.SBERT_embed(s_list, language)[source]¶ Use Sentence-BERT to embed sentences. s_list: a list of sentences/ tokens to be embedded. language: the language of the sentences (‘English’, ‘French’, ‘Arabic’). output: the embeddings of the sentences/ tokens. SocialED.utils.utility.DS_Combin(alpha, classes)[source]¶ Parameters: alpha – All Dirichlet distribution parameters. Returns: Combined Dirichlet distribution parameters. SocialED.utils.utility.graph_statistics(G, save_path)[source]¶
SocialED.utils¶
utility¶
A set of utility functions to support social event detection tasks.
Construct a graph from a DataFrame containing social media data.
df (pandas.DataFrame) – DataFrame containing social media data with columns: tweet_id, user_mentions, user_id, entities, sampled_words
G (networkx.Graph, optional (default=None)) – Existing graph to add nodes/edges to. If None, creates new graph.
G – Graph with nodes for tweets, users, entities and words, and edges between them.
networkx.Graph
Tokenize text for social event detection tasks.
text (str) – The input text to tokenize.
max_length (int, optional (default=512)) – Maximum length of tokenized sequence.
tokens – List of tokenized words/subwords.
list
Pretty print the dictionary ‘params’.
params (dict) – The dictionary to pretty print
offset (int, optional (default=0)) – The offset at the beginning of each line
printer (callable, optional (default=repr)) – The function to convert entries to strings
Pretty printed string representation
str
Validate the input GPU ID is valid on the given environment. If no GPU is presented, return ‘cpu’.
gpu_id (int) – GPU ID to check.
device – Valid device, e.g., ‘cuda:0’ or ‘cpu’.
str
Check if a parameter value is within specified bounds.
value (int or float) – The parameter value to check
lower (int or float) – Lower bound
upper (int or float) – Upper bound
param_name (str) – Name of the parameter for error messages
include_left (bool, optional (default=True)) – Whether to include lower bound in valid range
include_right (bool, optional (default=True)) – Whether to include upper bound in valid range
True if parameter is valid, raises ValueError otherwise
bool
Get current time as formatted string.
Current time in format ‘YYYY-MM-DD HH:MM:SS’
str
Compute cosine similarity between two sets of vectors.
z1 (torch.Tensor) – First set of vectors
z2 (torch.Tensor) – Second set of vectors
Similarity matrix
torch.Tensor
Use Sentence-BERT to embed sentences. s_list: a list of sentences/ tokens to be embedded. language: the language of the sentences (‘English’, ‘French’, ‘Arabic’). output: the embeddings of the sentences/ tokens.
alpha – All Dirichlet distribution parameters.
Combined Dirichlet distribution parameters.