Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus: fix job option usage at user level #5255

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open

Conversation

jtkrogel
Copy link
Contributor

@jtkrogel jtkrogel commented Dec 16, 2024

Proposed changes

This PR improves job option handling in Nexus, making the interface often used at the Machine level available at the user level.

For example, to achieve what is desired in #5240, the following input can be used:

settings(
    pseudo_dir = './pseudopotentials',
    results    = '',
    sleep      = 3,
    machine    = 'ws16',
    )

...

scf = generate_pwscf(
    job = job(cores=16,app='pw.x',run_options=dict(bind_to='--bind-to none')),
    ...
    )

...

This results in a call to mpirun like:

      Executing:  
        export OMP_NUM_THREADS=1
        mpirun --bind-to none -np 16 pw.x -input scf.in 

These forms are now equivalent:

job(...,run_options=dict(bind_to='--bind-to none'))
job(...,run_options=['--bind-to none'])
job(...,run_options='--bind-to none')

All current tests pass.

What type(s) of changes does this code introduce?

  • Bugfix
  • New feature

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

Laptop

Checklist

  • Yes. This PR is up to date with current the current state of 'develop'

@prckent
Copy link
Contributor

prckent commented Dec 16, 2024

Thanks Jaron.

  1. Is it possible to specify these options at a higher level still, as well as at the low level? Likely every job run on a specific machine+set of software will need the same options. Maybe a couple of exceptions. We don't want users to have to modify their Nexus scripts on a per machine level except at the top.
  2. Can you update the Nexus documentation to describe these options?

@jtkrogel
Copy link
Contributor Author

Option (+other) data can be shared at the top of the Nexus file. This is often how large scripts are handled, and/or ones meant to run on multiple machines.

Here is an example of three pwscf jobs across two machines that share software and runtime/submission options:

settings(
    pseudo_dir = './pseudopotentials',
    results    = '',
    sleep      = 3,
    machine    = 'ws128',
    )

if settings.machine=='ws128':
    # jobs for 128 core workstation
    scf_opts1 = obj(app         = 'pw.x',
                    run_options = '--bind-to none')
    scf_opts2 = obj(app         = 'pw.x',
                    run_options = '--bind-to none',
                    app_options = '-nk 8')
    scf_job1 = job(cores= 64,**scf_opts1)
    scf_job2 = job(cores= 64,**scf_opts2)
    scf_job3 = job(cores=128,**scf_opts2)

elif settings.machine=='inti':
    # jobs for "Inti" cluster
    qe_presub = '''
module purge
module load mpi/openmpi-x86_64  
module load qe/quantum-espresso 
'''
    scf_opts1 = obj(nodes       = 1,
                    hours       = 1,
                    app         = 'pw.x',
                    run_options = '--bind-to none',
                    presub      = qe_presub)
    scf_opts2 = obj(nodes       = 1,
                    hours       = 1,
                    app         = 'pw.x',
                    run_options = '--bind-to none',
                    app_options = '-nk 8',
                    presub      = qe_presub)
    scf_job1 = job(processes_per_node=64,**scf_opts1)
    scf_job2 = job(processes_per_node=64,**scf_opts2)
    scf_job3 = job(**scf_opts2)

else:
    print('machine unknown!')
    exit()
#end if

@ye-luo
Copy link
Contributor

ye-luo commented Dec 16, 2024

Test this please

@ye-luo ye-luo enabled auto-merge December 16, 2024 21:32
@ye-luo
Copy link
Contributor

ye-luo commented Dec 16, 2024

Does this PR fix the CI failure in #5240?

@jtkrogel
Copy link
Contributor Author

This PR should not cause the CI to fail. Nexus behaves in the same way as before (and all tests pass).

Instead, it should open the way to allow usage of flags to mpirun as needed per machine environment.

@prckent prckent disabled auto-merge December 16, 2024 21:53
@prckent
Copy link
Contributor

prckent commented Dec 16, 2024

I would like to request documentation is added.

I am concerned at the relative rate of growth of Nexus functionality vs growth of the documentation.

An example like the above plus suitable discussion would be a good addition imo, and also improve discoverability of the functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants