Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static Curriculum Learning Using Popularity Labels #228

Open
rezaBarzgar opened this issue Jan 3, 2024 · 3 comments
Open

Static Curriculum Learning Using Popularity Labels #228

rezaBarzgar opened this issue Jan 3, 2024 · 3 comments
Assignees
Labels
curriculum experiment Running a study or baseline for results

Comments

@rezaBarzgar
Copy link
Member

To define a static difficulty measurer for the task of neural team formation, we can use the popularity labels for each team. Assuming that we have the popularity label for each team, we can use torch.utils.data.SubsetRandomSampler to customize the proportion of popular and non-popular teams in each batch. There can be two different approaches to applying CL on this task:

  • Batch-based: changing the proportion of the popular and non-popular teams in each k batch in each epoch. So, in each epoch, we start with a batch with more popular teams and fewer non-popular teams, but at the end of each epoch, we have a batch with fewer popular and more non-popular teams. Generally, we will have some epochs that consist of batches with different difficulties.

  • Epoch-based (More common): In this approach, the difficulty level typically changes across epochs, not within individual batches. In the early epochs, more popular (easy) examples are presented to the model. As training progresses, in the last epochs, more non-popular (challenging) examples are introduced, encouraging the model to generalize and learn more complex patterns.

Currently, we only have popularity labels for each individual expert, not teams. One possible solution that comes to my mind is that we can assign a popularity label for a team based on the number of popular/non-popular experts in the team. For example, a team with a majority of popular experts can be considered a popular team.

@hosseinfani, since the epoch-based approach is more common in the CL literature, I’m starting with this. I put these here to confirm the popularity labeling for the teams and the static CL approach with you.

@hosseinfani
Copy link
Member

@rezaBarzgar "a team with a majority of popular experts" >> you need to specify what "majority" means, i.e., 60%, ..., 90%, 100% of a team? Also it may depend on domain. Like in a paper, a team with 1-2 popular authors out of 4-5 authors (teams' average size), in movies, a popular movie's casncrow are all (90-100%) popular.

Anyways, you need to specify a reasonable percentage and see the results.

@hosseinfani hosseinfani added the experiment Running a study or baseline for results label Jan 4, 2024
@rezaBarzgar
Copy link
Member Author

rezaBarzgar commented Jan 4, 2024

I calculate popularity labels for each team based on the proportion of popular experts in the team for imdb. If the proportion of popular experts in a team is greater than the specified proportion, the team is labelled as popular; otherwise, it is labelled as not popular.

Here is the code (I'll also push with my next updates):

import torch
import pandas as pd
import numpy as np
import pickle


def label_generator(vecs_path, expert_popularity_label_path, proportion):
    with open(vecs_path + '/teamsvecs.pkl', 'rb') as file:
        teamsvecs = pickle.load(file)
    experts_popularity_label = pd.read_csv(expert_popularity_label_path, index_col='memberidx').to_numpy().squeeze()
    team_popularity_label = []
    for idx, team in enumerate(teamsvecs['member']):
        experts = team.rows[0]
        populars_count = experts_popularity_label[team.rows[0]].sum()
        team_popularity_label.append(True if (populars_count / len(experts)) > proportion else False)

    team_popularity_label = np.array(team_popularity_label)
    print(f'percentage of popular teams: {(team_popularity_label.sum() / len(team_popularity_label)) * 100}')
    


if __name__ == '__main__':
    vecs_pth = './data/preprocessed/imdb/title.basics.tsv.filtered.mt75.ts3'
    expert_popularity_label_pth = './data/preprocessed/imdb/popularity.imdb.mt75.csv'
    for proportion in [0.1, 0.3, 0.5, 0.7, 0.9]:
        print(f'proportion: {proportion}')
        label_generator(vecs_pth, expert_popularity_label_pth, proportion)

Here are the results for different proportions:
proportion: 0.1
percentage of popular teams: 86.4
proportion: 0.3
percentage of popular teams: 82.7
proportion: 0.5
percentage of popular teams: 66.8
proportion: 0.7
percentage of popular teams: 52.7

proportion: 0.8
percentage of popular teams: 42.8
proportion: 0.9
percentage of popular teams: 40.7

@hosseinfani
Copy link
Member

so go ahead with 0.7 but schedule the runs for all other proportions, also include 0.0 and 1.0 for testing purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
curriculum experiment Running a study or baseline for results
Projects
None yet
Development

No branches or pull requests

2 participants