1.2 Exploring Data

Finding jobs

In section one of this tutorial, we evaluated the ideal gas equation and stored the results in the job document and in a file called V.txt. Let’s now have a look at how we can explore our data space for basic and advanced analysis.

We already saw how to iterate over the complete data space using the “for job in project” expression. This is a short-hand notation for “for job in project.find_jobs()”, meaning: “find all jobs”.

Instead of finding all jobs, we can also find a subset using filters.

Let’s get started by getting a handle on our project using the get_project() function. We don’t need to initialize the project again, since we already did that in section 1.

[1]:
import signac

project = signac.get_project("projects/tutorial")

Next, we assume that we would like to find all jobs, where p=10.0. For this, we can use the find_jobs() method, which takes a dictionary of parameters as filter argument.

[2]:
for job in project.find_jobs({"p": 10.0}):
    print(job.statepoint())
{'p': 10.0, 'kT': 1.0, 'N': 1000}

In this case, that is of course only a single job.

You can execute the same kind of find operation on the command line with $ signac find, as will be shown later.

While the filtering method is optimized for a simple dissection of the data space, it is possible to construct more complex query routines for example using list comprehensions.

This is an example for how to select all jobs where the pressure p is greater than 0.1:

[3]:
jobs_p_gt_0_1 = [job for job in project if job.sp.p > 0.1]
for job in jobs_p_gt_0_1:
    print(job.statepoint(), job.document)
{'p': 10.0, 'kT': 1.0, 'N': 1000} {'V': 100.0}
{'p': 1.0, 'kT': 1.0, 'N': 1000} {'V': 1000.0}

Finding jobs by certain criteria requires an index of the data space. In all previous examples this index was created implicitly, however depending on the data space size, it may make sense to create the index explicitly for multiple uses. This is shown in the next section.

Indexing

An index is a complete record of the data and its associated metadata within our project’s data space. To generate an index for our project’s data space, use the index() method:

[4]:
for doc in project.index():
    print(doc)
{'_id': '5a6c687f7655319db24de59a2336eff8', 'statepoint': {'p': 0.1, 'kT': 1.0, 'N': 1000}, 'V': 10000.0, 'signac_id': '5a6c687f7655319db24de59a2336eff8', 'root': 'notebooks/projects/tutorial/workspace'}
{'_id': '5a456c131b0c5897804a4af8e77df5aa', 'statepoint': {'p': 10.0, 'kT': 1.0, 'N': 1000}, 'V': 100.0, 'signac_id': '5a456c131b0c5897804a4af8e77df5aa', 'root': 'notebooks/projects/tutorial/workspace'}
{'_id': 'ee617ad585a90809947709a7a45dda9a', 'statepoint': {'p': 1.0, 'kT': 1.0, 'N': 1000}, 'V': 1000.0, 'signac_id': 'ee617ad585a90809947709a7a45dda9a', 'root': 'notebooks/projects/tutorial/workspace'}

Using an index to operate on data is particularly useful in later stages of a computational investigation, where data may come from different projects and the actual storage location of files is less important.

You can store the index wherever it may be useful, e.g., a file, a database, or even just in a variable for repeated find operations within one script. The signac framework provides the Collection class, which can be utilized to manage indeces in memory and on disk.

[5]:
index = signac.Collection(project.index())

for doc in index.find({"statepoint.p": 10.0}):
    print(doc)
{'_id': '5a456c131b0c5897804a4af8e77df5aa', 'statepoint': {'p': 10.0, 'kT': 1.0, 'N': 1000}, 'V': 100.0, 'signac_id': '5a456c131b0c5897804a4af8e77df5aa', 'root': 'notebooks/projects/tutorial/workspace'}

Views

Sometimes we want to examine our data on the file system directly. However the file paths within the workspace are obfuscated by the job id. The solution is to use views, which are human-readable, maximally compact hierarchical links to our data space.

To create a linked view we simply execute the create_linked_view() method within python or the $ signac view command on the command line.

[6]:
project.create_linked_view(prefix="projects/tutorial/view")
%ls projects/tutorial/view
p/

The view paths only contain parameters which actually vary across the different jobs. In this example, that is only the pressure p.

This allows us to examine the data with highly-compact human-readable path names:

[7]:
%ls 'projects/tutorial/view/p/1.0/job/'
%cat 'projects/tutorial/view/p/1.0/job/V.txt'
V.txt  signac_job_document.json  signac_statepoint.json
1000.0

NOTE: Update your view after adding or removing jobs by executing the view command for the same prefix again!

Tip: Consider creating a linked view for large data sets on an **in-memory** file system for best performance!

The next section will demonstrate how to implement a basic, but complete workflow for more expensive computations.