API Reference¶
This is the API for the signac (core) application.
The Project¶
Attributes
Project.build_job_search_index (index[, _trust]) |
Build a job search index. |
Project.build_job_statepoint_index ([…]) |
Build a state point index to identify jobs with specific parameters. |
Project.check () |
Check the project’s workspace for corruption. |
Project.clone (job[, copytree]) |
Clone job into this project. |
Project.config |
Get project’s configuration. |
Project.create_access_module ([filename, …]) |
Create the access module for indexing. |
Project.create_linked_view ([prefix, …]) |
Create or update a persistent linked view of the selected data space. |
Project.detect_schema ([exclude_const, …]) |
Detect the project’s state point schema. |
Project.data |
Get data associated with this project. |
Project.doc |
Get document associated with this project. |
Project.document |
Get document associated with this project. |
Project.dump_statepoints (statepoints) |
Dump the state points and associated job ids. |
Project.export_to (target[, path, copytree]) |
Export all jobs to a target location, such as a directory or a (compressed) archive file. |
Project.find_job_ids ([filter, doc_filter, index]) |
Find the job_ids of all jobs matching the filters. |
Project.find_jobs ([filter, doc_filter]) |
Find all jobs in the project’s workspace. |
Project.fn (filename) |
Prepend a filename with the project’s root directory path. |
Project.get_id () |
Get the project identifier. |
Project.get_statepoint (jobid[, fn]) |
Get the state point associated with a job id. |
Project.groupby ([key, default]) |
Group jobs according to one or more state point parameters. |
Project.groupbydoc ([key, default]) |
Group jobs according to one or more document values. |
Project.import_from ([origin, schema, sync, …]) |
Import the data space located at origin into this project. |
Project.id |
Get the project identifier. |
Project.index ([formats, depth, skip_errors, …]) |
Generate an index of the project’s workspace. |
Project.isfile (filename) |
Check if a filename exists in the project’s root directory. |
Project.min_len_unique_id () |
Determine the minimum length required for a job id to be unique. |
Project.num_jobs () |
Return the number of initialized jobs. |
Project.open_job ([statepoint, id]) |
Get a job handle associated with a state point. |
Project.read_statepoints ([fn]) |
Read all state points from a file. |
Project.repair ([fn_statepoints, index, job_ids]) |
Attempt to repair the workspace after it got corrupted. |
Project.reset_statepoint (job, new_statepoint) |
Reset the state point of job. |
Project.root_directory () |
Return the project’s root directory. |
Project.stores |
Get HDF5-stores associated with this project. |
Project.sync (other[, strategy, exclude, …]) |
Synchronize this project with the other project. |
Project.update_cache () |
Update the persistent state point cache. |
Project.update_statepoint (job, update[, …]) |
Update the state point of this job. |
Project.workspace () |
Return the project’s workspace directory. |
Project.write_statepoints ([statepoints, fn, …]) |
Dump state points to a file. |
-
class
signac.
Project
(config=None, _ignore_schema_version=False)¶ Bases:
object
The handle on a signac project.
Application developers should usually not need to directly instantiate this class, but use
get_project()
instead.Parameters: - config – The project configuration to use. By default, it loads the first signac project configuration found while searching upward from the current working directory (Default value = None).
- _ignore_schema_version (bool) – (Default value = False).
-
FN_CACHE
= '.signac_sp_cache.json.gz'¶ The default filename for the state point cache file.
-
FN_DOCUMENT
= 'signac_project_document.json'¶ The project’s document filename.
-
FN_STATEPOINTS
= 'signac_statepoints.json'¶ The default filename to read from and write state points to.
-
KEY_DATA
= 'signac_data'¶ The project’s datastore key.
-
build_job_search_index
(index, _trust=False)¶ Build a job search index.
Parameters: - index (list) – A document index.
- _trust – (Default value = False).
Returns: A job search index based on the provided index.
Return type: JobSearchIndex
Deprecated since version 1.3: This will be removed in 2.0.
-
build_job_statepoint_index
(exclude_const=False, index=None)¶ Build a state point index to identify jobs with specific parameters.
This method generates pairs of state point keys and mappings of values to a set of all corresponding job ids. The pairs are ordered by the number of different values. Since state point keys may be nested, they are represented as a tuple. For example:
>>> for i in range(4): ... project.open_job({'a': i, 'b': {'c': i % 2}}).init() ... >>> for key, value in project.build_job_statepoint_index(): ... print(key) ... pprint.pprint(value) ... ('b', 'c') defaultdict(<class 'set'>, {0: {'3a530c13bfaf57517b4e81ecab6aec7f', '4e9a45a922eae6bb5d144b36d82526e4'}, 1: {'d49c6609da84251ab096654971115d0c', '5c2658722218d48a5eb1e0ef7c26240b'}}) ('a',) defaultdict(<class 'set'>, {0: {'4e9a45a922eae6bb5d144b36d82526e4'}, 1: {'d49c6609da84251ab096654971115d0c'}, 2: {'3a530c13bfaf57517b4e81ecab6aec7f'}, 3: {'5c2658722218d48a5eb1e0ef7c26240b'}})
Values that are constant over the complete data space can be optionally ignored with the exclude_const argument set to True.
Parameters: - exclude_const (bool) – Exclude entries that are shared by all jobs that are part of the index (Default value = False).
- index – A document index.
Yields: tuple – Pairs of state point keys and mappings of values to a set of all corresponding job ids (Default value = None).
Deprecated since version 1.3: This will be removed in 2.0. Use the detect_schema() function instead.
-
check
()¶ Check the project’s workspace for corruption.
Raises: JobsCorruptedError
– When one or more jobs are identified as corrupted.
-
clone
(job, copytree=<function copytree>)¶ Clone job into this project.
Create an identical copy of job within this project.
See signac clone for the command line equivalent.
Parameters: - job (
Job
) – The job to copy into this project. - copytree – (Default value = syncutil.copytree)
Returns: The job instance corresponding to the copied job.
Return type: Raises: DestinationExistsError
– In case that a job with the same id is already initialized within this project.- job (
-
config
¶ Get project’s configuration.
Returns: Dictionary containing project’s configuration. Return type: _ProjectConfig
-
create_access_module
(filename=None, main=True, master=None)¶ Create the access module for indexing.
This method generates the access module required to make this project’s index part of a main index.
Parameters: Returns: Access module name.
Return type: Deprecated since version 1.5: This will be removed in 2.0. Access modules are deprecated.
-
create_linked_view
(prefix=None, job_ids=None, index=None, path=None)¶ Create or update a persistent linked view of the selected data space.
Similar to
export_to()
, this function expands the data space for the selected jobs, but instead of copying data will create symbolic links to the individual job workspace directories. This is primarily useful for browsing through the data space using a file-browser with human-interpretable directory paths.By default, the paths of the view will be based on variable state point keys as part of the implicit schema of the selected jobs that we create the view for. For example, creating a linked view for a data space with schema
>>> print(project.detect_schema()) { 'foo': 'int([0, 1, 2, ..., 8, 9], 10)', }
by calling
project.create_linked_view('my_view')
will look similar to:my_view/foo/0/job -> workspace/b8fcc6b8f99c56509eb65568922e88b8 my_view/foo/1/job -> workspace/b6cd26b873ae3624653c9268deff4485 ...
It is possible to control the paths using the
path
argument, which behaves in the exact same manner as the equivalent argument forexport_to()
.Note
The behavior of this function is almost equivalent to
project.export_to('my_view', copytree=os.symlink)
with the major difference that view hierarchies are actually updated, meaning that invalid links are automatically removed.See signac view for the command line equivalent.
Parameters: - prefix (str) – The path where the linked view will be created or updated (Default value = None).
- job_ids (iterable) – If None (the default), create the view for the complete data space, otherwise only for this iterable of job ids.
- index – A document index (Default value = None).
- path – The path (function) used to structure the linked data space (Default value = None).
Returns: A dictionary that maps the source directory paths to the linked directory paths.
Return type:
-
data
¶ Get data associated with this project.
This property should be used for large array-like data, which can’t be stored efficiently in the project document. For examples and usage, see Centralized Project Data.
Equivalent to:
return project.stores['signac_data']
See also
H5Store
: Usage examples.Returns: An HDF5-backed datastore. Return type: H5Store
-
detect_schema
(exclude_const=False, subset=None, index=None)¶ Detect the project’s state point schema.
See signac schema for the command line equivalent.
Parameters: - exclude_const (bool) – Exclude all state point keys that are shared by all jobs within this project (Default value = False).
- subset – A sequence of jobs or job ids specifying a subset over which the state point schema should be detected (Default value = None).
- index – A document index (Default value = None).
Returns: The detected project schema.
Return type: ProjectSchema
-
doc
¶ Get document associated with this project.
Alias for
document()
.Returns: The project document. Return type: JSONDict
-
document
¶ Get document associated with this project.
Returns: The project document. Return type: JSONDict
-
dump_statepoints
(statepoints)¶ Dump the state points and associated job ids.
Equivalent to:
{project.open_job(sp).id: sp for sp in statepoints}
Parameters: statepoints (iterable) – A list of state points. Returns: A mapping, where the key is the job id and the value is the state point. Return type: dict
-
export_to
(target, path=None, copytree=None)¶ Export all jobs to a target location, such as a directory or a (compressed) archive file.
Use this function in combination with
find_jobs()
to export only a select number of jobs, for example:project.find_jobs({'foo': 0}).export_to('foo_0.tar')
The
path
argument enables users to control how exactly the exported data space is to be expanded. By default, the path-function will be based on the implicit schema of the exported jobs. For example, exporting jobs that all differ by a state point key foo withproject.export_to('data/')
, the exported directory structure could look like this:data/foo/0 data/foo/1 ...
That would be equivalent to specifying
path=lambda job: os.path.join('foo', job.sp.foo)
.Instead of a function, we can also provide a string, where fields for state point keys are automatically formatted. For example, the following two path arguments are equivalent: “foo/{foo}” and “foo/{job.sp.foo}”.
Any attribute of job can be used as a field here, so
job.doc.bar
,job._id
, andjob.ws
can also be used as path fields.A special
{{auto}}
field allows us to expand the path automatically with state point keys that have not been specified explicitly. So, for example, one can providepath="foo/{foo}/{{auto}}"
to specify that the path shall begin withfoo/{foo}/
, but is then automatically expanded with all other state point key-value pairs. How key-value pairs are concatenated can be controlled via the format-specifier, so for example,path="{{auto:_}}"
will generate a structure such asdata/foo_0 data/foo_1 ...
Finally, providing
path=False
is equivalent topath="{job._id}"
.See also
import_from()
:- Previously exported or non-signac data spaces can be imported.
See signac export for the command line equivalent.
Parameters: - target – A path to a directory to export to. The target can not already exist. Besides directories, possible targets are tar files (.tar), gzipped tar files (.tar.gz), zip files (.zip), bzip2-compressed files (.bz2), and xz-compressed files (.xz).
- path – The path (function) used to structure the exported data space.
This argument must either be a callable which returns a path (str) as a function
of job, a string where fields are replaced using the job-state point dictionary,
or False, which means that we just use the job-id as path.
Defaults to the equivalent of
{{auto}}
. - copytree – The function used for the actual copying of directory tree
structures. Defaults to
shutil.copytree()
. Can only be used when the target is a directory.
Returns: A dict that maps the source directory paths, to the target directory paths.
Return type:
-
find_job_ids
(filter=None, doc_filter=None, index=None)¶ Find the job_ids of all jobs matching the filters.
The optional filter arguments must be a Mapping of key-value pairs and JSON serializable.
Note
Providing a pre-calculated index may vastly increase the performance of this function.
Parameters: - filter (dict) – A mapping of key-value pairs that all indexed job state points are compared against (Default value = None).
- doc_filter (dict) – A mapping of key-value pairs that all indexed job documents are compared against (Default value = None).
- index – A document index. If not provided, an index will be computed (Default value = None).
Returns: Return type: The ids of all indexed jobs matching both filter(s)
Raises: TypeError
– If the filters are not JSON serializable.ValueError
– If the filters are invalid.RuntimeError
– If the filters are not supported by the index.
Deprecated since version 1.3: This will be removed in 2.0. Use find_jobs() instead, then access ids with job.id.Replicate the original behavior with [job.id for job in project.find_jobs()]
-
find_jobs
(filter=None, doc_filter=None)¶ Find all jobs in the project’s workspace.
The optional filter arguments must be a Mapping of key-value pairs and JSON serializable. The filter argument is used to search against job state points, whereas the doc_filter argument compares against job document keys.
See signac find for the command line equivalent.
Parameters: - filter (Mapping) – A mapping of key-value pairs that all indexed job state points are compared against (Default value = None).
- doc_filter (Mapping) – A mapping of key-value pairs that all indexed job documents are compared against (Default value = None).
Returns: JobsCursor of jobs matching the provided filter(s).
Return type: JobsCursor
Raises: TypeError
– If the filters are not JSON serializable.ValueError
– If the filters are invalid.RuntimeError
– If the filters are not supported by the index.
-
fn
(filename)¶ Prepend a filename with the project’s root directory path.
Parameters: filename (str) – The name of the file. Returns: The joined path of project root directory and filename. Return type: str
-
get_id
()¶ Get the project identifier.
Returns: The project id. Return type: str Deprecated since version 1.3: This will be removed in 2.0. Use project.id instead.
-
classmethod
get_job
(root=None)¶ Find a Job in or above the current working directory (or provided path).
Parameters: root (str) – The job root directory. If no root directory is given, the current working directory is assumed to be the job directory (Default value = None). Returns: The job instance. Return type: Job
Raises: LookupError
– When job cannot be found.
-
classmethod
get_project
(root=None, search=True, **kwargs)¶ Find a project configuration and return the associated project.
Parameters: - root (str) – The starting point to search for a project, defaults to the current working directory.
- search (bool) – If True, search for project configurations inside and above the specified root directory, otherwise only return projects with a root directory identical to the specified root argument (Default value = True).
- **kwargs – Optional keyword arguments that are forwarded to the
Project
class constructor.
Returns: An instance of
Project
.Return type: Raises: LookupError
– When project configuration cannot be found.
-
get_statepoint
(jobid, fn=None)¶ Get the state point associated with a job id.
The state point is retrieved from the internal cache, from the workspace or from a state points file.
Parameters: - jobid (str) – A job id to get the state point for.
- fn (str) – The filename of the file containing the state points, defaults
to
FN_STATEPOINTS
.
Returns: The state point corresponding to jobid.
Return type: Raises: KeyError
– If the state point associated with jobid could not be found.JobsCorruptedError
– If the state point manifest file corresponding to jobid is inaccessible or corrupted.
Deprecated since version 1.3: This will be removed in 2.0. Use open_job(id=jobid).statepoint() function instead.
-
groupby
(key=None, default=None)¶ Group jobs according to one or more state point parameters.
This method can be called on any
JobCursor
such as the one returned byfind_jobs()
or by iterating over a project.Examples
# Group jobs by state point parameter 'a'. for key, group in project.groupby('a'): print(key, list(group)) # Find jobs where job.sp['a'] is 1 and group them # by job.sp['b'] and job.sp['c']. for key, group in project.find_jobs({'a': 1}).groupby(('b', 'c')): print(key, list(group)) # Group by job.sp['d'] and job.document['count'] using a lambda. for key, group in project.groupby( lambda job: (job.sp['d'], job.document['count']) ): print(key, list(group))
If key is None, jobs are grouped by identity (by id), placing one job into each group.
Parameters: - key (str, iterable, or callable) – The state point grouping parameter(s) passed as a string, iterable of strings, or a callable that will be passed one argument, the job (Default value = None).
- default – A default value to be used when a given state point key is not present. The value must be sortable and is only used if not None (Default value = None).
Returns: - key (str) – Grouped key.
- group (iterable of Jobs) – Iterable of Job instances matching this group key.
-
groupbydoc
(key=None, default=None)¶ Group jobs according to one or more document values.
This method can be called on any
JobCursor
such as the one returned byfind_jobs()
or by iterating over a project.Examples
# Group jobs by document value 'a'. for key, group in project.groupbydoc('a'): print(key, list(group)) # Find jobs where job.sp['a'] is 1 and group them # by job.document['b'] and job.document['c']. for key, group in project.find_jobs({'a': 1}).groupbydoc(('b', 'c')): print(key, list(group)) # Group by whether 'd' is a field in the job.document using a lambda. for key, group in project.groupbydoc(lambda doc: 'd' in doc): print(key, list(group))
If key is None, jobs are grouped by identity (by id), placing one job into each group.
Parameters: - key (str, iterable, or function) – The state point grouping parameter(s) passed as a string, iterable of strings,
or a function that will be passed one argument,
document()
. (Default value = None). - default – A default value to be used when a given state point key is not present (must be sortable).
- key (str, iterable, or function) – The state point grouping parameter(s) passed as a string, iterable of strings,
or a function that will be passed one argument,
-
import_from
(origin=None, schema=None, sync=None, copytree=None)¶ Import the data space located at origin into this project.
This function will walk through the data space located at origin and will try to identify data space paths that can be imported as a job workspace into this project.
The
schema
argument expects a function that takes a path argument and returns a state point dictionary. A default function is used when no argument is provided. The default schema function will simply look for state point manifest files–usually namedsignac_statepoint.json
–and then import all data located within that path into the job workspace corresponding to the state point specified in the manifest file.Alternatively the schema argument may be a string, that is converted into a schema function, for example: Providing
foo/{foo:int}
as schema argument means that all directories underfoo/
will be imported and their names will be interpreted as the value forfoo
within the state point.Tip
Use
copytree=os.replace
orcopytree=shutil.move
to move dataspaces on import instead of copying them.Warning: Imports can fail due to conflicts. Moving data instead of copying may therefore lead to inconsistent states and users are advised to apply caution.
See also
export_to()
: Export the project data space.See signac import for the command line equivalent.
Parameters: - origin – The path to the data space origin, which is to be imported. This may be a path to a directory, a zip file, or a tarball archive (Default value = None).
- schema – An optional schema function, which is either a string or a function that accepts a path as its first and only argument and returns the corresponding state point as dict. (Default value = None).
- sync – If
True
, the project will be synchronized with the imported data space. If a dict of keyword arguments is provided, the arguments will be used forsync()
(Default value = None). - copytree – Specify which exact function to use for the actual copytree operation.
Defaults to
shutil.copytree()
.
Returns: A dict that maps the source directory paths to the target directory paths.
Return type:
-
index
(formats=None, depth=0, skip_errors=False, include_job_document=True)¶ Generate an index of the project’s workspace.
This generator function indexes every file in the project’s workspace until the specified depth. The job document if it exists, is always indexed, other files need to be specified with the formats argument.
See signac project -i for the command line equivalent.
for doc in project.index({r'.*\.txt', 'TextFile'}): print(doc)
Parameters: - formats (str, dict) – The format definitions as a pattern string (e.g.
r'.*\.txt'
) or a mapping from pattern strings to formats (e.g.'TextFile'
). If None, only the job document is indexed (Default value = None). - depth (int) – Specifies the crawling depth. A value of 0 means no limit (Default value = 0).
- skip_errors (bool) – Skip all errors which occur during indexing. This is useful when trying to repair a broken workspace (Default value = False).
- include_job_document (bool) – Include the contents of job documents (Default value = True).
Yields: dict – Index document.
- formats (str, dict) – The format definitions as a pattern string (e.g.
-
classmethod
init_project
(name, root=None, workspace=None, make_dir=True)¶ Initialize a project with the given name.
It is safe to call this function multiple times with the same arguments. However, a RuntimeError is raised if an existing project configuration would conflict with the provided initialization parameters.
See signac init for the command line equivalent.
Parameters: - name (str) – The name of the project to initialize.
- root (str) – The root directory for the project. Defaults to the current working directory.
- workspace (str) – The workspace directory for the project.
Defaults to a subdirectory
workspace
in the project root. - make_dir (bool) – Create the project root directory if it does not exist yet (Default value = True).
Returns: Initialized project, an instance of
Project
.Return type: Raises: RuntimeError
– If the project root path already contains a conflicting project configuration.
-
isfile
(filename)¶ Check if a filename exists in the project’s root directory.
Parameters: filename (str) – The name of the file. Returns: True if filename exists in the project’s root directory. Return type: bool
-
min_len_unique_id
()¶ Determine the minimum length required for a job id to be unique.
This method’s runtime scales with the number of jobs in the workspace.
Returns: Minimum string length of a unique job identifier. Return type: int
-
num_jobs
()¶ Return the number of initialized jobs.
Returns: Count of initialized jobs. Return type: int
-
open_job
(statepoint=None, id=None)¶ Get a job handle associated with a state point.
This method returns the job instance associated with the given state point or job id. Opening a job by a valid state point never fails. Opening a job by id requires a lookup of the state point from the job id, which may fail if the job was not previously initialized.
Parameters: Returns: The job instance.
Return type: Raises: KeyError
– If the attempt to open the job by id fails.LookupError
– If the attempt to open the job by an abbreviated id returns more than one match.
-
read_statepoints
(fn=None)¶ Read all state points from a file.
See also
dump_statepoints()
- Dump the state points and associated job ids.
write_statepoints()
- Dump state points to a file.
Parameters: fn (str) – The filename of the file containing the state points, defaults to FN_STATEPOINTS
.Returns: State points. Return type: dict
-
repair
(fn_statepoints=None, index=None, job_ids=None)¶ Attempt to repair the workspace after it got corrupted.
This method will attempt to repair lost or corrupted job state point manifest files using a state points file or a document index or both.
Parameters: - fn_statepoints (str) – The filename of the file containing the state points, defaults
to
FN_STATEPOINTS
. - index – A document index (Default value = None).
- job_ids – An iterable of job ids that should get repaired. Defaults to all jobs.
Raises: JobsCorruptedError
– When one or more corrupted job could not be repaired.- fn_statepoints (str) – The filename of the file containing the state points, defaults
to
-
reset_statepoint
(job, new_statepoint)¶ Reset the state point of job.
Danger
Use this function with caution! Resetting a job’s state point, may sometimes be necessary, but can possibly lead to incoherent data spaces.
Parameters: - job (
Job
) – The job that should be reset to a new state point. - new_statepoint (mapping) – The job’s new state point.
Raises: DestinationExistsError
– If a job associated with the new state point is already initialized.OSError
– If the move failed due to an unknown system related error.
Deprecated since version 1.3: This will be removed in 2.0. Use job.reset_statepoint() instead.
- job (
-
root_directory
()¶ Return the project’s root directory.
Returns: Path of project directory. Return type: str
-
stores
¶ Get HDF5-stores associated with this project.
Use this property to access an HDF5 file within the project’s root directory using the H5Store dict-like interface.
This is an example for accessing an HDF5 file called
'my_data.h5'
within the project’s root directory:project.stores['my_data']['array'] = np.random((32, 4))
This is equivalent to:
H5Store(project.fn('my_data.h5'))['array'] = np.random((32, 4))
Both the project.stores and the H5Store itself support attribute access. The above example could therefore also be expressed as:
project.stores.my_data.array = np.random((32, 4))
Returns: The HDF5-Store manager for this project. Return type: H5StoreManager
-
sync
(other, strategy=None, exclude=None, doc_sync=None, selection=None, **kwargs)¶ Synchronize this project with the other project.
Try to clone all jobs from the other project to this project. If a job is already part of this project, try to synchronize the job using the optionally specified strategies.
See signac sync for the command line equivalent.
Parameters: - other (
Project
) – The other project to synchronize this project with. - strategy – A file synchronization strategy (Default value = None).
- exclude – Files with names matching the given pattern will be excluded from the synchronization (Default value = None).
- doc_sync – The function applied for synchronizing documents (Default value = None).
- selection – Only sync the given jobs (Default value = None).
- **kwargs – This method also accepts the same keyword arguments as the
sync_projects()
function.
Raises: DocumentSyncConflict
– If there are conflicting keys within the project or job documents that cannot be resolved with the given strategy or if there is no strategy provided.FileSyncConflict
– If there are differing files that cannot be resolved with the given strategy or if no strategy is provided.SchemaSyncConflict
– In case that the check_schema argument is True and the detected state point schema of this and the other project differ.
- other (
-
temporary_project
(name=None, dir=None)¶ Context manager for the initialization of a temporary project.
The temporary project is by default created within the root project’s workspace to ensure that they share the same file system. This is an example for how this method can be used for the import and synchronization of external data spaces.
with project.temporary_project() as tmp_project: tmp_project.import_from('/data') project.sync(tmp_project)
Parameters: Returns: An instance of
Project
.Return type:
-
to_dataframe
(*args, **kwargs)¶ Export the project metadata to a pandas DataFrame.
The arguments to this function are forwarded to
to_dataframe()
.Parameters: - *args –
- **kwargs –
Returns: Return type:
-
update_cache
()¶ Update the persistent state point cache.
This function updates a persistent state point cache, which is stored in the project root directory. Most data space operations, including iteration and filtering or selection are expected to be significantly faster after calling this function, especially for large data spaces.
-
update_statepoint
(job, update, overwrite=False)¶ Update the state point of this job.
Warning
While appending to a job’s state point is generally safe, modifying existing parameters may lead to data inconsistency. Use the overwrite argument with caution!
Parameters: - job (
Job
) – The job whose state point shall be updated. - update (mapping) – A mapping used for the state point update.
- overwrite – Set to true to ignore whether this update overwrites parameters, which are currently part of the job’s state point. Use with caution! (Default value = False).
Raises: KeyError
– If the update contains keys, which are already part of the job’s state point and overwrite is False.DestinationExistsError
– If a job associated with the new state point is already initialized.OSError
– If the move failed due to an unknown system related error.
Deprecated since version 1.3: This will be removed in 2.0. Use job.update_statepoint() instead.
- job (
-
workspace
()¶ Return the project’s workspace directory.
The workspace defaults to project_root/workspace. Configure this directory with the ‘workspace_dir’ attribute. If the specified directory is a relative path, the absolute path is relative from the project’s root directory.
Note
The configuration will respect environment variables, such as
$HOME
.See signac project -w for the command line equivalent.
Returns: Path of workspace directory. Return type: str
-
write_statepoints
(statepoints=None, fn=None, indent=2)¶ Dump state points to a file.
If the file already contains state points, all new state points will be appended, while the old ones are preserved.
See also
dump_statepoints()
- Dump the state points and associated job ids.
Parameters: - statepoints (iterable) – A list of state points, defaults to all state points which are defined in the workspace.
- fn (str) – The filename of the file containing the state points, defaults to
FN_STATEPOINTS
. - indent (int) – Specify the indentation of the JSON file (Default value = 2).
The Job class¶
Attributes
Job.clear () |
Remove all job data, but not the job itself. |
Job.close () |
Close the job and switch to the previous working directory. |
Job.data |
Get data associated with this job. |
Job.doc |
Alias for document . |
Job.document |
Get document associated with this job. |
Job.fn (filename) |
Prepend a filename with the job’s workspace directory path. |
Job.get_id () |
Job’s state point unique identifier. |
Job.id |
Get the unique identifier for the job’s state point. |
Job.init ([force]) |
Initialize the job’s workspace directory. |
Job.isfile (filename) |
Return True if file exists in the job’s workspace. |
Job.move (project) |
Move this job to project. |
Job.open () |
Enter the job’s workspace directory. |
Job.remove () |
Remove the job’s workspace including the job document. |
Job.reset () |
Remove all job data, but not the job itself. |
Job.reset_statepoint (new_statepoint) |
Reset the state point of this job. |
Job.sp |
Alias for statepoint . |
Job.statepoint |
Get the job’s state point. |
Job.stores |
Get HDF5 stores associated with this job. |
Job.sync (other[, strategy, exclude, doc_sync]) |
Perform a one-way synchronization of this job with the other job. |
Job.update_statepoint (update[, overwrite]) |
Update the state point of this job. |
Job.workspace () |
Return the job’s unique workspace directory. |
Job.ws |
Alias for workspace() . |
-
class
signac.contrib.job.
Job
(project, statepoint, _id=None)¶ Bases:
object
The job instance is a handle to the data of a unique state point.
Application developers should usually not need to directly instantiate this class, but use
open_job()
instead.Parameters: -
FN_DOCUMENT
= 'signac_job_document.json'¶ The job’s document filename.
-
FN_MANIFEST
= 'signac_statepoint.json'¶ The job’s manifest filename.
The job manifest, this means a human-readable dump of the job’s state point is stored in each workspace directory.
-
KEY_DATA
= 'signac_data'¶ The job’s datastore key.
-
clear
()¶ Remove all job data, but not the job itself.
This function will do nothing if the job was not previously initialized.
See signac rm -c for the command line equivalent.
-
close
()¶ Close the job and switch to the previous working directory.
-
data
¶ Get data associated with this job.
This property should be used for large array-like data, which can’t be stored efficiently in the job document. For examples and usage, see Job Data Storage.
Equivalent to:
return job.stores['signac_data']
Returns: An HDF5-backed datastore. Return type: H5Store
-
doc
¶ Alias for
document
.Warning
If you need a deep copy that will not modify the underlying persistent JSON file, use
document
instead ofdoc
. For more information, seestatepoint
orJSONDict
.
-
document
¶ Get document associated with this job.
Warning
If you need a deep copy that will not modify the underlying persistent JSON file, use
document
instead ofdoc
. For more information, seestatepoint
orJSONDict
.See signac document for the command line equivalent.
Returns: The job document handle. Return type: JSONDict
-
fn
(filename)¶ Prepend a filename with the job’s workspace directory path.
Parameters: filename (str) – The name of the file. Returns: The full workspace path of the file. Return type: str
-
get_id
()¶ Job’s state point unique identifier.
Returns: The job id. Return type: str Deprecated since version 1.3: This will be removed in 2.0. Use job.id instead.
-
init
(force=False)¶ Initialize the job’s workspace directory.
This function will do nothing if the directory and the job manifest already exist.
Returns the calling job.
See signac job -c for the command line equivalent.
Parameters: force (bool) – Overwrite any existing state point’s manifest files, e.g., to repair them if they got corrupted (Default value = False). Returns: The job handle. Return type: Job
-
isfile
(filename)¶ Return True if file exists in the job’s workspace.
Parameters: filename (str) – The name of the file. Returns: True if file with filename exists in workspace. Return type: bool
-
move
(project)¶ Move this job to project.
This function will attempt to move this instance of job from its original project to a different project.
See signac move for the command line equivalent.
Parameters: project ( Project
) – The project to move this job to.
-
open
()¶ Enter the job’s workspace directory.
You can use the Job class as context manager:
with project.open_job(my_statepoint) as job: # manipulate your job data
Opening the context will switch into the job’s workspace, leaving it will switch back to the previous working directory.
-
remove
()¶ Remove the job’s workspace including the job document.
This function will do nothing if the workspace directory does not exist.
See signac rm for the command line equivalent.
-
reset
()¶ Remove all job data, but not the job itself.
This function will initialize the job if it was not previously initialized.
-
reset_statepoint
(new_statepoint)¶ Reset the state point of this job.
Danger
Use this function with caution! Resetting a job’s state point may sometimes be necessary, but can possibly lead to incoherent data spaces.
Parameters: new_statepoint (dict) – The job’s new state point.
-
sp
¶ Alias for
statepoint
.
-
statepoint
¶ Get the job’s state point.
Warning
The state point object behaves like a dictionary in most cases, but because it persists changes to the filesystem, making a copy requires explicitly converting it to a dict. If you need a modifiable copy that will not modify the underlying JSON file, you can access a dict copy of the state point by calling it, e.g.
sp_dict = job.statepoint()
instead ofsp = job.statepoint
. For more information, see :JSONDict
.See signac statepoint for the command line equivalent.
Returns: Returns the job’s state point. Return type: dict
-
stores
¶ Get HDF5 stores associated with this job.
Use this property to access an HDF5 file within the job’s workspace directory using the
H5Store
dict-like interface.This is an example for accessing an HDF5 file called ‘my_data.h5’ within the job’s workspace:
job.stores['my_data']['array'] = np.random((32, 4))
This is equivalent to:
H5Store(job.fn('my_data.h5'))['array'] = np.random((32, 4))
Both the
stores
and theH5Store
itself support attribute access. The above example could therefore also be expressed as:job.stores.my_data.array = np.random((32, 4))
Returns: The HDF5-Store manager for this job. Return type: H5StoreManager
-
sync
(other, strategy=None, exclude=None, doc_sync=None, **kwargs)¶ Perform a one-way synchronization of this job with the other job.
By default, this method will synchronize all files and document data with the other job to this job until a synchronization conflict occurs. There are two different kinds of synchronization conflicts:
- The two jobs have files with the same, but different content.
- The two jobs have documents that share keys, but those keys are associated with different values.
A file conflict can be resolved by providing a ‘FileSync’ strategy or by excluding files from the synchronization. An unresolvable conflict is indicated with the raise of a
FileSyncConflict
exception.A document synchronization conflict can be resolved by providing a doc_sync function that takes the source and the destination document as first and second argument.
Parameters: - other (Job) – The other job to synchronize from.
- strategy – A synchronization strategy for file conflicts. If no strategy is provided, a
SyncConflict
exception will be raised upon conflict (Default value = None). - exclude (str) – An filename exclude pattern. All files matching this pattern will be excluded from synchronization (Default value = None).
- doc_sync – A synchronization strategy for document keys. If this argument is None, by default no keys will be synchronized upon conflict.
- dry_run – If True, do not actually perform the synchronization.
- kwargs – Extra keyword arguments will be forward to the
sync_jobs()
function which actually excutes the synchronization operation. - **kwargs –
Raises: FileSyncConflict
– In case that a file synchronization results in a conflict.
-
update_statepoint
(update, overwrite=False)¶ Update the state point of this job.
Warning
While appending to a job’s state point is generally safe, modifying existing parameters may lead to data inconsistency. Use the overwrite argument with caution!
Parameters: - update (dict) – A mapping used for the state point update.
- overwrite – Set to true, to ignore whether this update overwrites parameters, which are currently part of the job’s state point. Use with caution! (Default value = False)
Raises:
-
workspace
()¶ Return the job’s unique workspace directory.
See signac job -w for the command line equivalent.
Returns: The path to the job’s workspace directory. Return type: str
-
ws
¶ Alias for
workspace()
.
-
The Collection¶
-
class
signac.
Collection
(docs=None, primary_key='_id', compresslevel=0, _trust=False)¶ A collection of documents.
The Collection class manages a collection of documents in memory or in a file on disk. A document is defined as a dictionary mapping of key-value pairs.
An instance of collection may be used to manage and search documents. For example, given a collection with member data, where each document contains a name entry and an age entry, we can find the name of all members that are at age 32 like this:
members = [ {'name': 'John', 'age': 32}, {'name': 'Alice', 'age': 28}, {'name': 'Kevin', 'age': 32}, # ... ] member_collection = Collection(members) for doc in member_collection.find({'age': 32}): print(doc['name'])
To iterate over all documents in the collection, use:
for doc in collection: print(doc)
By default a collection object will reside in memory. However, it is possible to manage a collection associated to a file on disk. To open a collection which is associated with a file on disk, use the
Collection.open()
class method:with Collection.open('collection.txt') as collection: for doc in collection.find({'age': 32}): print(doc)
The collection file is by default opened in a+ mode, which means it can be read from and written to and will be created if it does not exist yet.
Parameters: - docs (iterable) – Initialize the collection with these documents.
- primary_key (str) – The name of the key which serves as the primary index of the collection. Selecting documents by primary key has time complexity of O(N) in the worst case and O(1) on average. All documents must have a primary key value. The default primary key is _id.
- compresslevel (int) – The level of compression to use. Any positive value implies compression and is used by the underlying gzip implementation. Default value is 0 (no compression).
Raises: ValueError
– When first argument is a string.-
clear
()¶ Remove all documents from the collection.
-
close
()¶ Close this collection instance.
In case that the collection is associated with a file-object, all changes are flushed to the file and the file is closed.
It is not possible to re-open the same collection instance after closing it.
-
delete_many
(filter)¶ Delete all documents that match the filter.
Parameters: filter (dict) – A document that should be deleted must match this filter.
-
delete_one
(filter)¶ Delete one document that matches the filter.
Parameters: filter (dict) – The document that should be deleted must match this filter.
-
dump
(file=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)¶ Dump the collection in JSON-encoding to file.
The file argument defaults to sys.stdout, which means the encoded blob will be printed to screen in case that no file argument is provided.
For example, to dump to a file on disk, one could write:
with open('my_collection.txt', 'w') as file: collection.dump(file)
Parameters: file – The file to write the encoded blob to (Default value = sys.stdout).
-
find
(filter=None, limit=0)¶ Find all documents matching filter, but not more than limit.
This function searches the collection for all documents that match the given filter and returns a result vector. For example:
for doc in collection.find(my_filter): print(doc)
Nested values should be searched using the
.
operator, for example:docs = collection.find({'nested.value': 42})
will return documents with a nested structure:
{'nested': {'value': 42}}
.The result of
find()
can be stored and iterated over multiple times. In addition, the result vector can be queried for its size:docs = collection.find(my_filter) print(len(docs)) # the number of documents matching for doc in docs: # iterate over the result vector pass
Arithmetic Operators
- $eq: equal
- $ne: not equal
- $gt: greater than
- $gte: greater or equal than
- $lt: less than
- $lte: less or equal than
project.find({"a": {"$lt": 5})
Matches all docs with a less than 5.
Logical Operators
That includes $and and $or; both expect a list of expressions.
project.find({"$or": [{"a": 4}, {"b": {"$gt": 3}}]})
Matches all docs, where a is 4 or b is greater than 3.
Exists operator
Determines whether a specific key exists, or not, e.g.:
project.find({"a": {"$exists": True}})
Array operator
To determine whether specific elements are in ($in), or not in ($nin) an array, e.g.:
project.find({"a": {"$in": [0, 1, 2]}})
Matches all docs, where a is either 0, 1, or 2. Usage of $nin is equivalent.
Regular expression operator
Allows the “on-the-fly” evaluation of regular expressoions, e.g.:
project.find({"protocol": {"$regex": "foo"}})
Will match all docs with a protocol that contains the term ‘foo’.
$type operator
Matches when a value is of specific type, e.g.:
project.find({"protocol": {"$type": str}})
Finds all docs, where the value of protocol is of type str. Other types that can be checked are: int, float, bool, list, and null.
$where operator
Matches an arbitrary python expression, e.g.:
project.find({"foo": {"$where": "lambda x: x.startswith('bar')"}})
Matches all docs, where the value for foo starts with the word ‘bar’.
Parameters: Returns: A result object that iterates over all matching documents.
Return type: _CollectionSearchResults
Raises: ValueError
– In case that the filter argument is invalid.
-
find_one
(filter=None)¶ Return one document that matches the filter or None.
doc = collection.find_one(my_filter) if doc is None: print("No result found for filter", my_filter) else: print("Doc matching filter:", my_filter, doc)
Parameters: filter (dict) – The returned document must match the given filter (Default value = None). Returns: A matching document or None. Return type: dict Raises: ValueError
– In case that the filter argument is invalid.
-
flush
()¶ Write all changes to the associated file.
If the collection instance is associated with a file-object, calling the
flush()
method will write all changes to this file.This method is also called when the collection is explicitly or implicitly closed.
-
ids
¶ Get an iterator over the primary key in the collection.
Returns: iterator over the primary key in the collection. Return type: iterable
-
index
(key, build=False)¶ Get (and optionally build) the index for a given key.
An index allows to access documents by a specific key with minimal time complexity, e.g.:
age_index = member_collection.index('age') for _id in age_index[32]: print(member_collection[_id]['name'])
This means we can access documents by the ‘age’ key in O(1) time on average in addition to the primary key. Using the
find()
method will automatically build all required indexes for the particular search.Once an index has been built, it will be internally managed by the class and updated with subsequent changes. An index returned by this method is always current with the latest state of the collection.
Parameters: Returns: Index for the given key.
Return type: Raises: KeyError
– In case the build is False and the index has not been built yet or no index is present for the key.
-
insert_one
(doc)¶ Insert one document into the collection.
If the document does not have a value for the collection’s primary key yet, it will be assigned one.
_id = collection.insert_one(doc) assert _id in collection
Note
The document will be directly updated in case that it has no primary key and must therefore be mutable!
Parameters: doc (dict) – The document to be inserted. Returns: The _id of the inserted documented. Return type: str
-
main
()¶ Start a command line interface for this Collection.
Use this function to interact with this instance of Collection on the command line. For example, executing the following script:
# find.py with Collection.open('my_collection.txt') as c: c.main()
will enable us to search for documents on the command line like this:
$ python find.py '{"age": 32}' {"name": "John", "age": 32} {"name": "Kevin", "age": 32}
Raises: ValueError
– When both –id or –indent are selected.
-
classmethod
open
(filename, mode=None, compresslevel=None)¶ Open a collection associated with a file on disk.
Using this factory method will return a collection that is associated with a collection file on disk. For example:
with Collection.open('collection.txt') as collection: for doc in collection: print(doc)
will read all documents from the collection.txt file or create the file if it does not exist yet.
Modifications to the file will be written to the file when the
flush()
method is called or the collection is explicitly closed by calling theCollection.close()
method or implicitly by leaving the with-clause:with Collection.open('collection.txt') as collection: collection.update(my_docs) # All changes to the collection have been written to collection.txt.
The open-modes work as expected, so for example to open a collection file in read-only mode, use
Collection.open('collection.txt', 'r')
.Opening a gzip (*.gz) file also works as expected. Because gzip does not support a combined read and write mode, mode=*+ is not available. Be sure to open the file in read, write, or append mode as required. Due to the manner in which gzip works, opening a file in mode=wt will effectively erase the current file, so take care using mode=wt.
Parameters: - filename (str) – Name of file to read the documents from or create the file if it does not exist.
- mode (str) – Open the file with mode (Default value = None).
- compresslevel (int) – The level of compression to use. Any positive value implies compression and is used by the underlying gzip implementation. (Default value = None)
Returns: An instance of
Collection
.Return type: Raises: RuntimeError
– File open-mode is not None for in-memory collection or compressed collections are not opened in binary mode.
-
primary_key
¶ Get the name of the collection’s primary key (default=’_id’).
-
classmethod
read_json
(file=None)¶ Construct an instance of Collection from a JSON file.
Parameters: file – The json file to read, provided as either a filename or a file-like object (Default value = None). Returns: A Collection containing the JSON file Return type: Collection
-
replace_one
(filter, replacement, upsert=False)¶ Replace one document that matches the given filter.
The first document matching the filter will be replaced by the given replacement document. If the upsert argument is True, the replacement will be inserted in case that no document matches the filter.
Parameters: Returns: The id of the replaced (or upserted) documented.
Return type: Raises: ValueError
– In case that the filter argument is invalid.
-
to_json
(file=None)¶ Dump the collection as a JSON file.
This function returns the JSON-string directly if the file argument is None.
Parameters: file – The filename or a file-like object to write the JSON string to (Default value = None). Returns: JSON-string when file argument is not provided. Return type: JSON
-
update
(docs)¶ Update the collection with these documents.
Any existing documents with the same primary key will be replaced.
Parameters: docs (iterable) – A sequence of documents to be upserted into the collection.
The JSONDict¶
This class implements the interface for the job’s statepoint
and document
attributes, but can also be used stand-alone:
-
class
signac.
JSONDict
(filename=None, write_concern=False, parent=None)¶ A dict-like mapping interface to a persistent JSON file.
The JSONDict is a
MutableMapping
and therefore behaves similarly to adict
, but all data is stored persistently in the associated JSON file on disk.doc = JSONDict('data.json', write_concern=True) doc['foo'] = "bar" assert doc.foo == doc['foo'] == "bar" assert 'foo' in doc del doc['foo']
This class allows access to values through key indexing or attributes named by keys, including nested keys:
>>> doc['foo'] = dict(bar=True) >>> doc {'foo': {'bar': True}} >>> doc.foo.bar = False {'foo': {'bar': False}}
Warning
While the JSONDict object behaves like a dictionary, there are important distinctions to remember. In particular, because operations are reflected as changes to an underlying file, copying (even deep copying) a JSONDict instance may exhibit unexpected behavior. If a true copy is required, you should use the
_as_dict()
method to get a dictionary representation, and if necessary construct a new JSONDict instance:new_dict = JSONDict(old_dict._as_dict())
.Parameters: - filename – The filename of the associated JSON file on disk.
- write_concern – Ensure file consistency by writing changes back to a temporary file first, before replacing the original file. Default is False.
- parent – A parent instance of JSONDict or None.
-
buffered
()¶ Context manager for buffering read and write operations.
This context manager activates the “buffered” mode, which means that all read operations are cached, and all write operations are deferred until the buffered mode is deactivated.
-
clear
() → None. Remove all items from D.¶
-
get
(k[, d]) → D[k] if k in D, else d. d defaults to None.¶
-
items
() → a set-like object providing a view on D's items¶
-
keys
() → a set-like object providing a view on D's keys¶
-
pop
(k[, d]) → v, remove specified key and return the corresponding value.¶ If key is not found, d is returned if given, otherwise KeyError is raised.
-
popitem
() → (k, v), remove and return some (key, value) pair¶ as a 2-tuple; but raise KeyError if D is empty.
-
reset
(data)¶ Replace the document contents with data.
-
setdefault
(k[, d]) → D.get(k,d), also set D[k]=d if k not in D¶
-
update
([E, ]**F) → None. Update D from mapping/iterable E and F.¶ If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v
-
values
() → an object providing a view on D's values¶
The H5Store¶
This class implements the interface to the job’s data
attribute, but can also be used stand-alone:
-
class
signac.
H5Store
(filename, **kwargs)¶ An HDF5-backed container for storing array-like and dictionary-like data.
The H5Store is a
MutableMapping
and therefore behaves similar to adict
, but all data is stored persistently in the associated HDF5 file on disk.Supported types include:
- built-in types (int, float, str, bool, NoneType, array)
- numpy arrays
- pandas data frames (requires pandas and pytables)
- mappings with values that are supported types
Values can be accessed as attributes (
h5s.foo
) or via key index (h5s['foo']
).Examples
>>> from signac import H5Store >>> with H5Store('file.h5') as h5s: ... h5s['foo'] = 'bar' ... assert 'foo' in h5s ... assert h5s.foo == 'bar' ... assert h5s['foo'] == 'bar' >>>
The H5Store can be used as a context manager to ensure that the underlying file is opened, however most built-in types (excluding arrays) can be read and stored without the need to _explicitly_ open the file. To access arrays (reading or writing), the file must always be opened!
To open a file in read-only mode, use the
open()
method withmode='r'
:>>> with H5Store('file.h5').open(mode='r') as h5s: ... pass >>>
Parameters: - filename (str) – The filename of the underlying HDF5 file.
- **kwargs – Additional keyword arguments to be forwarded to the
h5py.File
constructor. See the documentation for the h5py.File constructor for more information.
-
clear
()¶ Remove all data from this store.
Danger
All data will be removed, this action cannot be reversed!
-
close
()¶ Close the underlying HDF5 file.
-
file
¶ Access the underlying instance of h5py.File.
This property exposes the underlying
h5py.File
object enabling use of functions such ascreate_dataset()
orrequires_dataset()
.Note
The store must be open to access this property!
Returns: The h5py
file-object that this store is operating on.Return type: h5py.File
Raises: H5StoreClosedError – When the store is closed at the time of accessing this property.
-
filename
¶ Return the H5Store filename.
-
flush
()¶ Flush the underlying HDF5 file.
-
get
(k[, d]) → D[k] if k in D, else d. d defaults to None.¶
-
items
() → a set-like object providing a view on D's items¶
-
keys
() → a set-like object providing a view on D's keys¶
-
mode
¶ Return the default opening mode of this H5Store.
-
open
(mode=None)¶ Open the underlying HDF5 file.
Parameters: mode – The file open mode to use. Defaults to ‘a’ (append). Returns: This H5Store instance.
-
pop
(k[, d]) → v, remove specified key and return the corresponding value.¶ If key is not found, d is returned if given, otherwise KeyError is raised.
-
popitem
() → (k, v), remove and return some (key, value) pair¶ as a 2-tuple; but raise KeyError if D is empty.
-
setdefault
(key, value)¶ Set a value for a key if that key is not already set.
-
update
([E, ]**F) → None. Update D from mapping/iterable E and F.¶ If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v
-
values
() → an object providing a view on D's values¶
The H5StoreManager¶
This class implements the interface to the job’s stores
attribute, but can also be used stand-alone:
-
class
signac.
H5StoreManager
(prefix)¶ Bases:
signac.core.dict_manager.DictManager
Helper class to manage multiple instances of
H5Store
within a directory.Example (assuming that the ‘stores/’ directory exists):
>>> stores = H5StoreManager('stores/') >>> stores.data <H5Store(filename=stores/data.h5)> >>> stores.data.foo = True >>> dict(stores.data) {'foo': True}
Parameters: prefix – The directory prefix shared by all stores managed by this class. -
keys
()¶ Return an iterable of keys.
-
prefix
¶ Return the prefix.
-
Top-level functions¶
The signac framework aids in the management of large and heterogeneous data spaces.
It provides a simple and robust data model to create a well-defined, indexable storage layout for data and metadata. This makes it easier to operate on large data spaces, streamlines post-processing and analysis, and makes data collectively accessible.
-
signac.
TemporaryProject
(name=None, cls=None, **kwargs)¶ Context manager for the generation of a temporary project.
This is a factory function that creates a Project within a temporary directory and must be used as context manager, for example like this:
with TemporaryProject() as tmp_project: tmp_project.import_from('/data')
Parameters: - name (str) – An optional name for the temporary project. Defaults to a unique random string.
- cls – The class of the temporary project.
Defaults to
Project
. - **kwargs – Optional keyword arguments that are forwarded to the TemporaryDirectory class constructor, which is used to create a temporary root directory.
Yields:
-
signac.
get_project
(root=None, search=True, **kwargs)¶ Find a project configuration and return the associated project.
Parameters: - root (str) – The starting point to search for a project, defaults to the current working directory.
- search (bool) – If True, search for project configurations inside and above the specified root directory, otherwise only return projects with a root directory identical to the specified root argument (Default value = True).
- **kwargs – Optional keyword arguments that are forwarded to
get_project()
.
Returns: An instance of
Project
.Return type: Raises: LookupError
– If no project configuration can be found.
-
signac.
init_project
(name, root=None, workspace=None, make_dir=True)¶ Initialize a project with the given name.
It is safe to call this function multiple times with the same arguments. However, a RuntimeError is raised if an existing project configuration would conflict with the provided initialization parameters.
Parameters: - name (str) – The name of the project to initialize.
- root (str) – The root directory for the project. Defaults to the current working directory.
- workspace (str) – The workspace directory for the project.
Defaults to a subdirectory
workspace
in the project root. - make_dir (bool) – Create the project root directory, if it does not exist yet (Default value = True).
Returns: The initialized project instance.
Return type: Raises: RuntimeError
– If the project root path already contains a conflicting project configuration.
-
signac.
get_job
(root=None)¶ Find a Job in or above the current working directory (or provided path).
Parameters: root (str) – The job root directory. If no root directory is given, the current working directory is assumed to be within the current job workspace directory (Default value = None). Returns: Job handle. Return type: Job
Raises: LookupError
– If this job cannot be found.Examples
When the current directory is a job workspace directory:
>>> signac.get_job() signac.contrib.job.Job(project=..., statepoint={...})
-
signac.
diff_jobs
(*jobs)¶ Find differences among a list of jobs’ state points.
The resulting diff is a dictionary where the keys are job ids and the values are each job’s state point minus the intersection of all provided jobs’ state points. The comparison is performed over the combined set of keys and values.
See signac diff for the command line equivalent.
Parameters: *jobs (sequence[ Job
]) – Sequence of jobs to diff.Returns: A dictionary where the keys are job ids and values are the unique parts of that job’s state point. Return type: dict Examples
>>> import signac >>> project = signac.init_project('project_name') >>> job1 = project.open_job({'constant': 42, 'diff1': 0, 'diff2': 1}).init() >>> job2 = project.open_job({'constant': 42, 'diff1': 1, 'diff2': 1}).init() >>> job3 = project.open_job({'constant': 42, 'diff1': 2, 'diff2': 2}).init() >>> print(job1) c4af2b26f1fd256d70799ad3ce3bdad0 >>> print(job2) b96b21fada698f8934d58359c72755c0 >>> print(job3) e4289419d2b0e57e4852d44a09f167c0 >>> signac.diff_jobs(job1, job2, job3) {'c4af2b26f1fd256d70799ad3ce3bdad0': {'diff2': 1, 'diff1': 0}, 'b96b21fada698f8934d58359c72755c0': {'diff2': 1, 'diff1': 1}, 'e4289419d2b0e57e4852d44a09f167c0': {'diff2': 2, 'diff1': 2}} >>> signac.diff_jobs(*project) {'c4af2b26f1fd256d70799ad3ce3bdad0': {'diff2': 1, 'diff1': 0}, 'b96b21fada698f8934d58359c72755c0': {'diff2': 1, 'diff1': 1}, 'e4289419d2b0e57e4852d44a09f167c0': {'diff2': 2, 'diff1': 2}}
-
signac.
get_database
(name, hostname=None, config=None)¶ Get a database handle.
The database handle is an instance of
Database
, which provides access to the document collections within one database.db = signac.db.get_database('MyDatabase') docs = db.my_collection.find()
Please note, that a collection which did not exist at the point of access, will automatically be created.
Parameters: - name (str) – The name of the database to get.
- hostname (str) – The name of the configured host. Defaults to the first configured host, or the host specified by default_host.
- config (
common.config.Config
) – The config object to retrieve the host configuration from. Defaults to the global configuration.
Returns: The database handle.
Return type: Deprecated since version 1.3: This will be removed in 2.0. The database package is deprecated.
-
signac.
fetch
(doc_or_id, mode='r', mirrors=None, num_tries=3, timeout=60, ignore_local=False)¶ Fetch the file associated with this document or file id.
This function retrieves a file associated with the provided index document or file id and behaves like the built-in
open()
function, e.g.:for doc in index: with signac.fetch(doc) as file: do_something_with(file)
Parameters: - doc_or_id – A file_id or a document with a file_id value.
- mode – Mode to use for opening files.
- mirrors – An optional set of mirrors to fetch the file from.
- num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.
- timeout (int) – The time in seconds to wait before an automatic retry attempt.
Returns: The file associated with the document or file id.
Return type: A file-like object
Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.
-
signac.
export_one
(doc, index, mirrors=None, num_tries=3, timeout=60)¶ Export one document to index and an optionally associated file to mirrors.
Parameters: - doc – A document with a file_id entry.
- docs – The index collection to export to.
- mirrors – An optional set of mirrors to export files to.
- num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.
- timeout (int) – The time in seconds to wait before an automatic retry attempt.
Returns: The id and file id after successful export.
Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.
-
signac.
export
(docs, index, mirrors=None, update=False, num_tries=3, timeout=60, **kwargs)¶ Export docs to index and optionally associated files to mirrors.
The behavior of this function is equivalent to:
for doc in docs: export_one(doc, index, mirrors, num_tries)
If the update argument is set to True, the export algorithm will automatically identify stale index documents, that means documents that refer to files or state points that have been removed and are no longer part of the data space. Any document which shares the root, but not the _id field with any of the updated documents is considered stale and removed. Using update in combination with an empty docs sequence will raise ExportError, since it is not possible to identify stale documents in that case.
Note
This function will automatically delegate to specialized implementations for special index types. For example, if the index argument is a MongoDB document collection, the index documents will be exported via
export_pymongo()
.Parameters: - docs – The index documents to export.
- index – The collection to export the index to.
- mirrors – An optional set of mirrors to export files to.
- update (bool) – If True, remove stale index documents, that means documents that refer to files or state points that no longer exist.
- num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.
- timeout (int) – The time in seconds to wait before an automatic retry attempt.
- kwargs – Optional keyword arguments to pass to delegate implementations.
Raises: ExportError – When using the update argument in combination with an empty docs sequence.
Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.
-
signac.
export_to_mirror
(doc, mirror, num_tries=3, timeout=60)¶ Export a file associated with doc to mirror.
Parameters: Returns: The file id after successful export.
Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.
-
signac.
export_pymongo
(docs, index, mirrors=None, update=False, num_tries=3, timeout=60, chunksize=100)¶ Optimized
export()
function for pymongo index collections.The behavior of this function is rougly equivalent to:
for doc in docs: export_one(doc, index, mirrors, num_tries)
Note
All index documents must be JSON-serializable to be able to be exported to a MongoDB collection.
Parameters: - docs – The index documents to export.
- index (
pymongo.collection.Collection
) – The database collection to export the index to. - num_tries (int) – The number of automatic retry attempts in case of mirror connection errors.
- timeout (int) – The time in seconds to wait before an automatic retry attempt.
- chunksize (int) – The buffer size for export operations.
Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.
-
signac.
index_files
(root='.', formats=None, depth=0)¶ Generate a file index.
This generator function yields file index documents, where each index document corresponds to one file.
To index all files in the current working directory, simply execute:
for doc in signac.index_files(): print(doc)
A file associated with a file index document can be fetched via the
fetch()
function:for doc in signac.index_files(): with signac.fetch(doc) as file: print(file.read())
This is especially useful if the file index is part of a collection (
Collection
) which can be searched for specific entries.To limit the file index to files with a specific filename formats, provide a regular expression as the formats argument. To index all files that have file ending .txt, execute:
for doc in signac.index_files(formats='.*\.txt'): print(doc)
We can specify specific formats by providing a dictionary as
formats
argument, where the key is the filename pattern and the value is an arbitrary formats string, e.g.:for doc in signac.index_files(formats= {r'.*\.txt': 'TextFile', r'.*\.zip': 'ZipFile'}): print(doc)
Parameters: Yields: The file index documents as dicts.
Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.
-
signac.
index
(root='.', tags=None, depth=0, **kwargs)¶ Generate a main index.
A main index is compiled from other indexes by searching for modules named
signac_access.py
and compiling all indexes which are yielded from a functionget_indexes(root)
defined within that module as well as the indexes generated by crawlers yielded from a functionget_crawlers(root)
defined within that module.This is a minimal example for a
signac_access.py
file:import signac def get_indexes(root): yield signac.index_files(root, r'.*\.txt')
Internally, this function constructs an instance of
MainCrawler
and all extra key-word arguments will be forwarded to the constructor of said main crawler.Parameters: - root (str) – Look for access modules under this directory path.
- tags – If tags are provided, do not execute subcrawlers that don’t match the same tags.
- depth (int) – Limit the search to the specified directory depth.
- kwargs – These keyword-arguments are forwarded to the internal MainCrawler instance.
Yields: The main index documents as instances of dict.
Deprecated since version 1.3: This will be removed in 2.0. The indexing module is deprecated.
-
signac.
buffered
(buffer_size=33554432, force_write=False)¶ Enter a global buffer mode for all JSONDict instances.
All future write operations are written to the buffer, read operations are performed from the buffer whenever possible.
All write operations are deferred until the flush_all() function is called, the buffer overflows, or upon exiting the buffer mode.
This context may be entered multiple times, however the buffer size can only be set once. Any subsequent specifications of the buffer size are ignored.
Parameters: buffer_size (int) – Specify the maximum size of the read/write buffer. Defaults to DEFAULT_BUFFER_SIZE. A negative number indicates to not restrict the buffer size.
-
signac.
is_buffered
()¶ Return true if in buffered read/write mode.
-
signac.
flush
()¶ Execute all deferred JSONDict write operations.
-
signac.
get_buffer_size
()¶ Return the current maximum size of the read/write buffer.
-
signac.
get_buffer_load
()¶ Return the current actual size of the read/write buffer.
Submodules¶
signac.cite module¶
Functions to support citing this software.
-
signac.cite.
bibtex
(file=None)¶ Generate bibtex entries for signac.
The bibtex entries will be printed to screen unless a filename or a file-like object are provided, in which case they will be written to the corresponding file.
Note
A full reference should also include the version of this software. Please refer to the documentation on how to cite a specific version.
Parameters: file – A str or file-like object. Defaults to sys.stdout.
-
signac.cite.
reference
(file=None)¶ Generate formatted reference entries for signac.
The references will be printed to screen unless a filename or a file-like object are provided, in which case they will be written to the corresponding file.
Note
A full reference should also include the version of this software. Please refer to the documentation on how to cite a specific version.
Parameters: file – A str or file-like object. Defaults to sys.stdout.
signac.sync module¶
Synchronization of jobs and projects.
Jobs may be synchronized by copying all data from the source job to the destination job. This means all files are copied and the documents are synchronized. Conflicts, that means both jobs contain conflicting data, may be resolved with a user defined strategy.
The synchronization of projects is in essence the synchronization of all jobs which are in the destination project with the ones in the source project and the sync synchronization of the project document. If a specific job does not exist yet at the destination it is simply cloned, otherwise it is synchronized.
A sync strategy is a function (or functor) that takes the source job,
the destination job, and the name of the file generating the conflict
as arguments and returns the decision whether to overwrite the file as
Boolean. There are some default strategies defined within this module as
part of the FileSync
class. These are the default strategies:
- always – Always overwrite on conflict.
- never – Never overwrite on conflict.
- update – Overwrite when the modification time of the source file is newer.
- Ask – Ask the user interactively about each conflicting filename.
For example, to synchronize two projects resolving conflicts by modification time, use:
dest_project.sync(source_project, strategy=sync.FileSync.update)
Unlike files, which are always either overwritten as a whole or not, documents
can be synchronized more fine-grained with a sync function. Such a function (or
functor) takes the source and the destination document as arguments and performs
the synchronization. The user is encouraged to implement their own sync functions,
but there are a few default functions implemented as part of the DocSync
class:
- NO_SYNC – Do not perform any synchronization.
- COPY – Apply the same strategy used to resolve file conflicts.
- update – Equivalent to dst.update(src).
- ByKey – Synchronize the source document key by key, more information below.
This is how we could synchronize two jobs, where the documents are synchronized with a simple update function:
dst_job.sync(src_job, doc_sync=sync.DocSync.update)
The DocSync.ByKey
functor attempts to synchronize the destination document
with the source document without overwriting any data. That means this function
behaves similar to update()
for a non-intersecting set of keys,
but in addition will preserve nested mappings without overwriting values. In addition,
any key conflict, that means keys that are present in both documents, but have
differing data, will lead to the raise of a DocumentSyncConflict
exception.
The user may expclitly decide to overwrite certain keys by providing a “key-strategy”,
which is a function that takes the conflicting key as argument, and returns the
decision whether to overwrite that specific key as Boolean. For example, to sync
two jobs, where conflicting keys should only be overwritten if they contain the
term ‘foo’, we could execute:
dst_job.sync(src_job, doc_sync=sync.DocSync.ByKey(lambda key: 'foo' in key))
This means that all documents are synchronized ‘key-by-key’ and only conflicting keys that
contain the word “foo” will be overwritten, any other conflicts would lead to the
raise of a DocumentSyncConflict
exception. A key-strategy may also be
a regular expression, so the synchronization above could also be achieved with:
dst_job.sync(src_job, doc_sync=sync.DocSync.ByKey('foo'))
-
class
signac.sync.
FileSync
¶ Bases:
object
Collection of file synchronization strategies.
-
class
Ask
¶ Bases:
object
Resolve sync conflicts by asking whether a file should be overwritten interactively.
-
static
always
(src, dst, fn)¶ Resolve sync conflicts by always overwriting.
-
classmethod
keys
()¶ Return keys.
-
static
never
(src, dst, fn)¶ Resolve sync conflicts by never overwriting.
-
static
update
(src, dst, fn)¶ Resolve sync conflicts based on newest modified timestamp.
-
class
-
class
signac.sync.
DocSync
¶ Bases:
object
Collection of document synchronization functions.
-
COPY
= 'copy'¶ Copy (and potentially overwrite) documents like any other file.
-
NO_SYNC
= False¶ Do not synchronize documents.
-
static
update
(src, dst)¶ Perform a simple update.
-
-
signac.sync.
sync_jobs
(src, dst, strategy=None, exclude=None, doc_sync=None, recursive=False, follow_symlinks=True, preserve_permissions=False, preserve_times=False, preserve_owner=False, preserve_group=False, deep=False, dry_run=False)¶ Synchronize the dst job with the src job.
By default, this method will synchronize all files and document data of dst job with the src job until a synchronization conflict occurs. There are two different kinds of synchronization conflicts:
- The two jobs have files with the same name, but different content.
- The two jobs have documents that share keys, but those keys are mapped to different values.
A file conflict can be resolved by providing a ‘FileSync’ strategy or by excluding files from the synchronization. An unresolvable conflict is indicated with the raise of a
FileSyncConflict
exception.A document synchronization conflict can be resolved by providing a doc_sync function that takes the source and the destination document as first and second argument.
Parameters: - src (
Job
) – The src job, data will be copied from this job’s workspace. - dst (
Job
) – The dst job, data will be copied to this job’s workspace. - strategy (callable) – A synchronization strategy for file conflicts. The strategy should be a
callable with signature
strategy(src, dst, filepath)
wheresrc
anddst
are the source and destination instances ofProject
andfilepath
is the filepath relative to the project root. If no strategy is provided, aerrors.SyncConflict
exception will be raised upon conflict. (Default value = None) - exclude (str) – A filename exclusion pattern. All files matching this pattern will be excluded from the synchronization process. (Default value = None)
- doc_sync (attribute or callable from
DocSync
) – A synchronization strategy for document keys. The default is to use a safe key-by-key strategy that will not overwrite any values on conflict, but instead raises aDocumentSyncConflict
exception. - recursive (bool) – Recursively synchronize sub-directories encountered within the job workspace directories. (Default value = False)
- follow_symlinks (bool) – Follow and copy the target of symbolic links. (Default value = True)
- preserve_permissions (bool) – Preserve file permissions (Default value = False)
- preserve_times (bool) – Preserve file modification times (Default value = False)
- preserve_owner (bool) – Preserve file owner (Default value = False)
- preserve_group (bool) – Preserve file group ownership (Default value = False)
- dry_run (bool) – If True, do not actually perform any synchronization operations. (Default value = False)
- deep (bool) – (Default value = False)
-
signac.sync.
sync_projects
(source, destination, strategy=None, exclude=None, doc_sync=None, selection=None, check_schema=True, recursive=False, follow_symlinks=True, preserve_permissions=False, preserve_times=False, preserve_owner=False, preserve_group=False, deep=False, dry_run=False, parallel=False, collect_stats=False)¶ Synchronize the destination project with the source project.
Try to clone all jobs from the source to the destination. If the destination job already exist, try to synchronize the job using the optionally specified strategy.
Parameters: - source (class:~.Project) – The project presenting the source for synchronization.
- destination (class:~.Project) – The project that is modified for synchronization.
- strategy (callable) – A synchronization strategy for file conflicts. The strategy should be a
callable with signature
strategy(src, dst, filepath)
wheresrc
anddst
are the source and destination instances ofProject
andfilepath
is the filepath relative to the project root. If no strategy is provided, aerrors.SyncConflict
exception will be raised upon conflict. (Default value = None) - exclude (str) – A filename exclusion pattern. All files matching this pattern will be excluded from the synchronization process. (Default value = None)
- doc_sync (attribute or callable from
DocSync
) – A synchronization strategy for document keys. The default is to use a safe key-by-key strategy that will not overwrite any values on conflict, but instead raises aDocumentSyncConflict
exception. - selection (sequence of
Job
or job ids (str)) – Only synchronize the given selection of jobs. (Default value = None) - check_schema (bool) – If True, only synchronize if this and the other project have a matching
state point schema. See also:
detect_schema()
. (Default value = True) - recursive (bool) – Recursively synchronize sub-directories encountered within the job workspace directories. (Default value = False)
- follow_symlinks (bool) – Follow and copy the target of symbolic links. (Default value = True)
- preserve_permissions (bool) – Preserve file permissions (Default value = False)
- preserve_times (bool) – Preserve file modification times (Default value = False)
- preserve_owner (bool) – Preserve file owner (Default value = False)
- preserve_group (bool) – Preserve file group ownership (Default value = False)
- dry_run (bool) – If True, do not actually perform the synchronization operation, just log what would happen theoretically. Useful to test synchronization strategies without the risk of data loss. (Default value = False)
- deep (bool) – (Default value = False)
- parallel (bool) – (Default value = False)
- collect_stats (bool) – (Default value = False)
Returns: Returns stats if
collect_stats
isTrue
, elseNone
.Return type: NoneType or
FileTransferStats
Raises: DocumentSyncConflict
– If there are conflicting keys within the project or job documents that cannot be resolved with the given strategy or if there is no strategy provided.FileSyncConflict
– If there are differing files that cannot be resolved with the given strategy or if no strategy is provided.SchemaSyncConflict
– In case that the check_schema argument is True and the detected state point schema of this and the other project differ.
signac.warnings module¶
Module for signac deprecation warnings.
-
exception
signac.warnings.
SignacDeprecationWarning
¶ Bases:
UserWarning
Indicates the deprecation of a signac feature, API or behavior.
This class indicates a user-relevant deprecation and is therefore a UserWarning, not a DeprecationWarning which is hidden by default.
signac.errors module¶
Errors raised by signac.
-
exception
signac.errors.
BufferException
¶ Bases:
signac.core.errors.Error
An exception occurred in buffered mode.
-
exception
signac.errors.
BufferedFileError
(files)¶ Bases:
signac.core.jsondict.BufferException
Raised when an error occurred while flushing one or more buffered files.
-
files
¶ A dictionary of files that caused issues during the flush operation, mapped to a possible reason for the issue or None in case that it cannot be determined.
-
-
exception
signac.errors.
ConfigError
¶ Bases:
signac.core.errors.Error
,RuntimeError
Error with parsing or reading a configuration file.
-
exception
signac.errors.
AuthenticationError
¶ Bases:
signac.core.errors.Error
,RuntimeError
Authentication error.
-
exception
signac.errors.
ExportError
¶ Bases:
signac.core.errors.Error
,RuntimeError
Error exporting documents to a mirror.
-
exception
signac.errors.
FetchError
¶ Bases:
FileNotFoundError
Error in fetching data.
-
exception
signac.errors.
DestinationExistsError
(destination)¶ Bases:
signac.core.errors.Error
,RuntimeError
The destination for a move or copy operation already exists.
Parameters: destination – The destination object causing the error.
-
exception
signac.errors.
JobsCorruptedError
(job_ids)¶ Bases:
signac.core.errors.Error
,RuntimeError
The state point manifest file of one or more jobs cannot be opened or is corrupted.
Parameters: job_ids – The job id(s) of the corrupted job(s).
-
exception
signac.errors.
IncompatibleSchemaVersion
¶ Bases:
signac.core.errors.Error
The project’s schema version is incompatible with this version of signac.
-
exception
signac.errors.
SyncConflict
¶ Bases:
signac.core.errors.Error
,RuntimeError
Raised when a synchronization operation fails.
-
exception
signac.errors.
FileSyncConflict
(filename)¶ Bases:
signac.errors.SyncConflict
Raised when a synchronization operation fails due to a file conflict.
-
filename
= None¶ The filename of the file that caused the conflict.
-
-
exception
signac.errors.
DocumentSyncConflict
(keys)¶ Bases:
signac.errors.SyncConflict
Raised when a synchronization operation fails due to a document conflict.
-
keys
= None¶ The keys that caused the conflict.
-
-
exception
signac.errors.
SchemaSyncConflict
(schema_src, schema_dst)¶ Bases:
signac.errors.SyncConflict
Raised when a synchronization operation fails due to schema differences.
-
exception
signac.errors.
InvalidKeyError
¶ Bases:
ValueError
Raised when a user uses a non-conforming key.