1.2 Exploring Data

Please note: The following notebook requires you first run signac_101_Getting_Started.

Finding jobs

In section one of this tutorial, we evaluated the ideal gas equation and stored the results in the job document and in a file called V.txt. Let’s now have a look at how we can explore our data space for basic and advanced analysis.

We already saw how to iterate over the complete data space using the “for job in project” expression. This is a short-hand notation for “for job in project.find_jobs()”, meaning: “find all jobs”.

Instead of finding all jobs, we can also find a subset using filters.

Let’s get started by getting a handle on our project using the get_project() function. We don’t need to initialize the project again, since we already did that in section 1.

[1]:
import signac

project = signac.get_project("projects/tutorial")

Next, we assume that we would like to find all jobs, where p=10.0. For this, we can use the find_jobs() method, which takes a dictionary of parameters as filter argument.

[2]:
for job in project.find_jobs({"p": 10.0}):
    print(job.statepoint())
{'p': 10.0, 'kT': 1.0, 'N': 1000}

In this case, that is of course only a single job.

You can execute the same kind of find operation on the command line with $ signac find, as will be shown later.

While the filtering method is optimized for a simple dissection of the data space, it is possible to construct more complex query routines for example using list comprehensions.

This is an example for how to select all jobs where the pressure p is greater than 0.1:

[3]:
jobs_p_gt_0_1 = [job for job in project if job.sp.p > 0.1]
for job in jobs_p_gt_0_1:
    print(job.statepoint(), job.document)
{'p': 10.0, 'kT': 1.0, 'N': 1000} {'V': 100.0}
{'p': 1.0, 'kT': 1.0, 'N': 1000} {'V': 1000.0}

Finding jobs by certain criteria requires an index of the data space, which signac automatically generates and uses internally.

Views

Sometimes we want to examine our data on the file system directly. However the file paths within the workspace are obfuscated by the job id. The solution is to use views, which are human-readable, maximally compact hierarchical links to our data space.

To create a linked view we simply execute the create_linked_view() method within python or the $ signac view command on the command line.

[4]:
project.create_linked_view(prefix="projects/tutorial/view")
%ls projects/tutorial/view
p/

The view paths only contain parameters which actually vary across the different jobs. In this example, that is only the pressure p.

This allows us to examine the data with highly-compact human-readable path names:

[5]:
%ls 'projects/tutorial/view/p/1.0/job/'
%cat 'projects/tutorial/view/p/1.0/job/V.txt'
V.txt  signac_job_document.json  signac_statepoint.json
1000.0

NOTE: Update your view after adding or removing jobs by executing the view command for the same prefix again!

Tip: Consider creating a linked view for large data sets on an in-memory file system for best performance!

The next section will demonstrate how to implement a basic, but complete workflow for more expensive computations.