1.3 A Basic Workflow

Please note: The following notebook requires you first run signac_101_Getting_Started.

This part of the tutorial requires NumPy.

Operations

For this part of the tutorial we will imagine that we are still not convinced of the pressure-volume relations that we just “discovered” and that calculating the volume is actually a very expensive procedure, such as a many particle simulation with HOOMD-blue.

We emulate this by adding an optional cost argument to our volume calculation function:

[1]:
from time import sleep


def V_idg(N, p, kT, cost=0):
    sleep(cost)
    return N * kT / p

It is useful to think of each modification of the workspace, that includes addition, modification, and removal of data, in terms of an operation.

An operation should take only one(!) argument: the job handle.

Any additional arguments may represent hidden state point parameters which would lead to a loss of provenance and possibly render our data space inconsistent.

The following function is an example for an operation:

[2]:
def compute_volume(job):
    print("Computing volume of", job)
    V = V_idg(cost=1, **job.statepoint())
    job.document["V"] = V
    with open(job.fn("V.txt"), "w") as file:
        file.write(str(V) + "\n")

This operation computes the volume solely based on the state point parameters and stores the results such that they are clearly associated with the job, i.e., in the job document and in a file within the job’s workspace.

Please note, that the only reason for storing the the same result in two different ways is for demonstration purposes.

Execution

To execute our first data space operation, we simply loop through our project’s data space:

[3]:
import signac

project = signac.get_project("projects/tutorial")

for job in project:
    compute_volume(job)
Computing volume of 5a6c687f7655319db24de59a2336eff8
Computing volume of 5a456c131b0c5897804a4af8e77df5aa
Computing volume of ee617ad585a90809947709a7a45dda9a

Data Space Initialization

Since our operation is now more expensive, it is a good idea to split initialization and execution. Let’s initialize a few more state points in one go:

[4]:
import numpy as np
import signac

project = signac.get_project("projects/tutorial")


def init_statepoints(n):
    for p in np.linspace(0.1, 10.0, n):
        sp = {"p": float(p), "kT": 1.0, "N": 1000}
        job = project.open_job(sp)
        job.init()
        print("initialize", job)


init_statepoints(5)
initialize 5a6c687f7655319db24de59a2336eff8
initialize d03270cdbbae73c8bb1d9fa0ab370264
initialize 973e29d6a4ed6cf7329c03c77df7f645
initialize 4cf2795722061df825ec9a4d5e31e494
initialize 5a456c131b0c5897804a4af8e77df5aa

We see that initializing more jobs and even reinitializing old jobs is no problem. However, since our calculation will be “expensive”, we would want to skip the computation whenever the result is already available.

One possibility is to add a simple check before executing the computation:

[5]:
for job in project:
    if "V" not in job.document:
        compute_volume(job)
Computing volume of 4cf2795722061df825ec9a4d5e31e494
Computing volume of 973e29d6a4ed6cf7329c03c77df7f645
Computing volume of d03270cdbbae73c8bb1d9fa0ab370264

Classification

It would be even better, if we could get an overview of which state points have been calculated and which not. We call this a project’s status.

Before we continue, let’s initialize a few more state points.

[6]:
init_statepoints(10)
initialize 5a6c687f7655319db24de59a2336eff8
initialize 22582e83c6b12336526ed304d4378ff8
initialize c0ab2e09a6f878019a6057175bf718e6
initialize 9110d0837ad93ff6b4013bae30091edd
initialize b45a2485a44a46364cc60134360ea5af
initialize 05061d2acea19d2d9a25ac3360f70e04
initialize 665547b1344fe40de5b2c7ace4204783
initialize 8629822576debc2bfbeffa56787ca348
initialize e8186b9b68e18a82f331d51a7b8c8c15
initialize 5a456c131b0c5897804a4af8e77df5aa

Next, we implement a classify() generator function, which labels a job based on certain conditions:

[7]:
def classify(job):
    yield "init"
    if "V" in job.document and job.isfile("V.txt"):
        yield "volume-computed"

Our classifier will always yield the init label, but the volume-computed label is only yielded if the result has been computed and stored both in the job document and as a text file. We can then use this function to get an overview of our project’s status.

[8]:
print(f"Status: {project}")
for job in project:
    labels = ", ".join(classify(job))
    p = round(job.sp.p, 1)
    print(job, p, labels)
Status: notebooks/projects/tutorial
22582e83c6b12336526ed304d4378ff8 1.2 init
4cf2795722061df825ec9a4d5e31e494 7.5 init, volume-computed
b45a2485a44a46364cc60134360ea5af 4.5 init
e8186b9b68e18a82f331d51a7b8c8c15 8.9 init
5a6c687f7655319db24de59a2336eff8 0.1 init, volume-computed
973e29d6a4ed6cf7329c03c77df7f645 5.0 init, volume-computed
d03270cdbbae73c8bb1d9fa0ab370264 2.6 init, volume-computed
05061d2acea19d2d9a25ac3360f70e04 5.6 init
665547b1344fe40de5b2c7ace4204783 6.7 init
8629822576debc2bfbeffa56787ca348 7.8 init
5a456c131b0c5897804a4af8e77df5aa 10.0 init, volume-computed
c0ab2e09a6f878019a6057175bf718e6 2.3 init
9110d0837ad93ff6b4013bae30091edd 3.4 init
ee617ad585a90809947709a7a45dda9a 1.0 init, volume-computed

Using only simple classification functions, we already get a very good grasp on our project’s overall status.

Furthermore, we can use the classification labels for controling the execution of operations:

[9]:
for job in project:
    labels = classify(job)
    if "volume-computed" not in labels:
        compute_volume(job)
Computing volume of 22582e83c6b12336526ed304d4378ff8
Computing volume of b45a2485a44a46364cc60134360ea5af
Computing volume of e8186b9b68e18a82f331d51a7b8c8c15
Computing volume of 05061d2acea19d2d9a25ac3360f70e04
Computing volume of 665547b1344fe40de5b2c7ace4204783
Computing volume of 8629822576debc2bfbeffa56787ca348
Computing volume of c0ab2e09a6f878019a6057175bf718e6
Computing volume of 9110d0837ad93ff6b4013bae30091edd

Parallelization

So far, we have executed all operations in serial using a simple for-loop. We will now learn how to easily parallelize the execution!

Instead of using a for-loop, we can also take advantage of Python’s built-in map-operator:

[10]:
list(map(compute_volume, project))
print("Done.")
Computing volume of 22582e83c6b12336526ed304d4378ff8
Computing volume of 4cf2795722061df825ec9a4d5e31e494
Computing volume of b45a2485a44a46364cc60134360ea5af
Computing volume of e8186b9b68e18a82f331d51a7b8c8c15
Computing volume of 5a6c687f7655319db24de59a2336eff8
Computing volume of 973e29d6a4ed6cf7329c03c77df7f645
Computing volume of d03270cdbbae73c8bb1d9fa0ab370264
Computing volume of 05061d2acea19d2d9a25ac3360f70e04
Computing volume of 665547b1344fe40de5b2c7ace4204783
Computing volume of 8629822576debc2bfbeffa56787ca348
Computing volume of 5a456c131b0c5897804a4af8e77df5aa
Computing volume of c0ab2e09a6f878019a6057175bf718e6
Computing volume of 9110d0837ad93ff6b4013bae30091edd
Computing volume of ee617ad585a90809947709a7a45dda9a
Done.

Using the map() expression makes it trivial to implement parallelization patterns, for example, using a process Pool. We provide these code snippets as examples but they are not executable. Instead, we recommend using the signac-flow package, which is designed for executing parallel workflows.

from multiprocessing import Pool

with Pool() as pool:
    pool.map(compute_volume, list(project))

Or a ThreadPool:

from multiprocessing.pool import ThreadPool

with ThreadPool() as pool:
    pool.map(compute_volume, list(project))

Uncomment and execute the following line if you want to remove all data and start over.

[11]:
# %rm -r projects/tutorial/workspace

In this section we learned how to create a simple, yet complete workflow for our computational investigation.

In the next section we will learn how to adjust the data space, e.g., modify existing state point parameters.