SocialED.dataprocess

dataprocess

Data processing utilities for multilingual social media data.

SocialED.utils.dataprocess.construct_graph(df, G=None)[source]

Construct a graph from a DataFrame containing social media data.

Parameters:
  • df (pandas.DataFrame) – DataFrame containing social media data with columns: tweet_id, user_mentions, user_id, entities, sampled_words

  • G (networkx.Graph, optional (default=None)) – Existing graph to add nodes/edges to. If None, creates new graph.

Returns:

G – Graph with nodes for tweets, users, entities and words, and edges between them.

Return type:

networkx.Graph

SocialED.utils.dataprocess.load_data(name, cache_dir=None)[source]

Data loading function that downloads .npy files from SocialED_datasets repository.

Parameters:
  • name (str) – The name of the dataset.

  • cache_dir (str, optional) – The directory for dataset caching. Default: None.

Returns:

data – The loaded dataset.

Return type:

numpy.ndarray

SocialED.utils.dataprocess.graph_statistics(G, save_path)[source]

Calculate and save basic statistics of a graph.

Parameters:
  • G (networkx.Graph) – The input graph to analyze.

  • save_path (str) – Directory path to save the statistics.

Returns:

num_isolated_nodes – Number of isolated nodes in the graph.

Return type:

int

SocialED.utils.dataprocess.extract_time_feature(t_str)[source]

Extract time features from timestamp string.

Parameters:

t_str (str) – Timestamp string in ISO format.

Returns:

List containing two normalized time features: [days, seconds].

Return type:

list

SocialED.utils.dataprocess.get_word2id_emb(wordpath, embpath)[source]

Load word-to-id mapping and embeddings from files.

Parameters:
  • wordpath (str) – Path to file containing words.

  • embpath (str) – Path to file containing embeddings.

Returns:

(word2id dictionary, embeddings array).

Return type:

tuple

SocialED.utils.dataprocess.df_to_t_features(df)[source]

Convert DataFrame timestamps to time features.

Parameters:

df (pandas.DataFrame) – DataFrame with ‘created_at’ column containing timestamps.

Returns:

Array of time features for each timestamp.

Return type:

numpy.ndarray

SocialED.utils.dataprocess.check_class_sizes(ground_truths, predictions)[source]

Check sizes of predicted classes against ground truth classes.

Parameters:
  • ground_truths (array-like) – Ground truth class labels.

  • predictions (array-like) – Predicted class labels.

Returns:

List of predicted class labels that are larger than average ground truth class size.

Return type:

list