SocialED.dataprocess¶

dataprocess¶

Data processing utilities for multilingual social media data.

SocialED.utils.dataprocess.construct_graph(df, G=None)[source]¶

Construct a graph from a DataFrame containing social media data.

Parameters:

df (pandas.DataFrame) – DataFrame containing social media data with columns: tweet_id, user_mentions, user_id, entities, sampled_words
G (networkx.Graph, optional (default=None)) – Existing graph to add nodes/edges to. If None, creates new graph.

Returns:

G – Graph with nodes for tweets, users, entities and words, and edges between them.

Return type:

networkx.Graph

SocialED.utils.dataprocess.load_data(name, cache_dir=None)[source]¶

Data loading function that downloads .npy files from SocialED_datasets repository.

Parameters:

name (str) – The name of the dataset.
cache_dir (str, optional) – The directory for dataset caching. Default: None.

Returns:

data – The loaded dataset.

Return type:

numpy.ndarray

SocialED.utils.dataprocess.graph_statistics(G, save_path)[source]¶

Calculate and save basic statistics of a graph.

Parameters:

Returns:

num_isolated_nodes – Number of isolated nodes in the graph.

Return type:

int

SocialED.utils.dataprocess.extract_time_feature(t_str)[source]¶

Extract time features from timestamp string.

SocialED.utils.dataprocess.get_word2id_emb(wordpath, embpath)[source]¶

Load word-to-id mapping and embeddings from files.

Parameters:

Returns:

(word2id dictionary, embeddings array).

Return type:

tuple

SocialED.utils.dataprocess.df_to_t_features(df)[source]¶

Convert DataFrame timestamps to time features.

Parameters:: df (pandas.DataFrame) – DataFrame with ‘created_at’ column containing timestamps.
Returns:: Array of time features for each timestamp.
Return type:: numpy.ndarray

SocialED.utils.dataprocess.check_class_sizes(ground_truths, predictions)[source]¶

Check sizes of predicted classes against ground truth classes.

Parameters:

Returns:

List of predicted class labels that are larger than average ground truth class size.

Return type:

list