SocialED.dataprocess¶ dataprocess¶ Data processing utilities for multilingual social media data. SocialED.utils.dataprocess.construct_graph(df, G=None)[source]¶ Construct a graph from a DataFrame containing social media data. Parameters: df (pandas.DataFrame) – DataFrame containing social media data with columns: tweet_id, user_mentions, user_id, entities, sampled_words G (networkx.Graph, optional (default=None)) – Existing graph to add nodes/edges to. If None, creates new graph. Returns: G – Graph with nodes for tweets, users, entities and words, and edges between them. Return type: networkx.Graph SocialED.utils.dataprocess.load_data(name, cache_dir=None)[source]¶ Data loading function that downloads .npy files from SocialED_datasets repository. Parameters: name (str) – The name of the dataset. cache_dir (str, optional) – The directory for dataset caching. Default: None. Returns: data – The loaded dataset. Return type: numpy.ndarray SocialED.utils.dataprocess.graph_statistics(G, save_path)[source]¶ Calculate and save basic statistics of a graph. Parameters: G (networkx.Graph) – The input graph to analyze. save_path (str) – Directory path to save the statistics. Returns: num_isolated_nodes – Number of isolated nodes in the graph. Return type: int SocialED.utils.dataprocess.extract_time_feature(t_str)[source]¶ Extract time features from timestamp string. Parameters: t_str (str) – Timestamp string in ISO format. Returns: List containing two normalized time features: [days, seconds]. Return type: list SocialED.utils.dataprocess.get_word2id_emb(wordpath, embpath)[source]¶ Load word-to-id mapping and embeddings from files. Parameters: wordpath (str) – Path to file containing words. embpath (str) – Path to file containing embeddings. Returns: (word2id dictionary, embeddings array). Return type: tuple SocialED.utils.dataprocess.df_to_t_features(df)[source]¶ Convert DataFrame timestamps to time features. Parameters: df (pandas.DataFrame) – DataFrame with ‘created_at’ column containing timestamps. Returns: Array of time features for each timestamp. Return type: numpy.ndarray SocialED.utils.dataprocess.check_class_sizes(ground_truths, predictions)[source]¶ Check sizes of predicted classes against ground truth classes. Parameters: ground_truths (array-like) – Ground truth class labels. predictions (array-like) – Predicted class labels. Returns: List of predicted class labels that are larger than average ground truth class size. Return type: list
SocialED.dataprocess¶
dataprocess¶
Data processing utilities for multilingual social media data.
Construct a graph from a DataFrame containing social media data.
df (pandas.DataFrame) – DataFrame containing social media data with columns: tweet_id, user_mentions, user_id, entities, sampled_words
G (networkx.Graph, optional (default=None)) – Existing graph to add nodes/edges to. If None, creates new graph.
G – Graph with nodes for tweets, users, entities and words, and edges between them.
networkx.Graph
Data loading function that downloads .npy files from SocialED_datasets repository.
name (str) – The name of the dataset.
cache_dir (str, optional) – The directory for dataset caching. Default:
None.data – The loaded dataset.
numpy.ndarray
Calculate and save basic statistics of a graph.
G (networkx.Graph) – The input graph to analyze.
save_path (str) – Directory path to save the statistics.
num_isolated_nodes – Number of isolated nodes in the graph.
int
Extract time features from timestamp string.
t_str (str) – Timestamp string in ISO format.
List containing two normalized time features: [days, seconds].
list
Load word-to-id mapping and embeddings from files.
wordpath (str) – Path to file containing words.
embpath (str) – Path to file containing embeddings.
(word2id dictionary, embeddings array).
tuple
Convert DataFrame timestamps to time features.
df (pandas.DataFrame) – DataFrame with ‘created_at’ column containing timestamps.
Array of time features for each timestamp.
numpy.ndarray
Check sizes of predicted classes against ground truth classes.
ground_truths (array-like) – Ground truth class labels.
predictions (array-like) – Predicted class labels.
List of predicted class labels that are larger than average ground truth class size.
list