In the era of big data and interconnected systems, organizations face the challenge of effectively managing and analyzing complex relationships within their data. Graph databases have emerged as a powerful solution for representing and querying highly connected data, enabling businesses to gain valuable insights. With GUAC (Graph for Understanding Artifact Composition), we face a similar issue. With large amounts of data from SBOMs, SLSA attestations, and other ITE-6 based attestations, it is integral that we can effectively manage and create the edges that represent the relationships between these documents.
As we work to meet the goals of persistence in GUAC, we are running a series of analyses and comparisons among the many different graph database options. GUAC has a few critically important requirements for the backend, including: efficient ingestion of data, performant complex queries, the schema in which the data is stored, and finally optimization of the query based on the specific language.
Evaluating the Requirements
- Efficient Data Ingestion: A graph database that excels in data ingestion is essential for managing large-scale datasets. The speed and efficiency of data ingestion directly impact the ability to update and integrate new information into the database seamlessly. This is important as software supply chain metadata will constantly be growing. For example, the time it takes to ingest packages and create the dependency relationship between them. In graphQL this would relationship would look like:
mutation IngestPackage($pkg: PkgInputSpec!) {
ingestPackage(pkg: $pkg) {
...AllPkgTree
}
}
mutation IsDependency(
$pkg: PkgInputSpec!
$depPkg: PkgInputSpec!
$dependency: IsDependencyInputSpec!
) {
ingestDependency(pkg: $pkg, depPkg: $depPkg, dependency: $dependency) {
...allIsDependencyTree
}
}
- First, we would have to ingest the packages (both package and dependent) and after that ingest the actual dependency node that connects to two packages together. We have to work to reduce the time that it takes complete these series of steps in the most efficient manner.
- Quick and Complex Query Execution: Graph databases provide powerful querying capabilities for traversing complex relationships and uncovering valuable insights. The efficiency of query execution is crucial, especially when dealing with large datasets. For GUAC, being able to answer specific queries quickly can allow for quick and proactive decision-making. While complex queries might still take time that a fast-paced industry leader cannot afford, we can work around this by pre-computing results from these queries and creating “proceed” or “not proceed” type attestations that can be updated based on a set schedule. For example, for GUAC we would want to quickly determine which all the vulnerabilities (even transitive) for a particular package or artifact:
./bin/guacone query vuln "pkg:guac/spdx/ghcr.io/guacsec/vul-image-latest"
+-------------+-----------+---------------------------------------+
| NODE TYPE | NODE ID # | ADDITIONAL INFORMATION |
+-------------+-----------+---------------------------------------+
| certifyVuln | 131588 | vulnerability ID: ghsa-vwqq-5vrc-xw9h |
| certifyVuln | 131585 | vulnerability ID: ghsa-8489-44mv-ggj8 |
| certifyVuln | 131594 | vulnerability ID: ghsa-jfh8-c2jp-5v3q |
| certifyVuln | 131598 | vulnerability ID: ghsa-7rjr-3q55-vv33 |
| certifyVuln | 131600 | vulnerability ID: ghsa-p6xc-xr62-6r2g |
| certifyVuln | 131590 | vulnerability ID: ghsa-fxph-q3j8-mv87 |
| certifyVuln | 131574 | vulnerability ID: ghsa-599f-7c49-w659 |
+-------------+-----------+---------------------------------------+
- The query above has to traverse the relationships between the packages and vulnerability information to provide us the output shown above.
- Schema: Data ingestion and retrieval is not just reliant on the database you choose to work with, it also depends on the schema in which the data is stored. A complex schema with unnecessarily fragmented data can result in poor performance for ingestion and querying. Simplifying the schema greatly improves performance by reducing the number of reads and writes that need to take place. In graphQl the schema would look like the following for dependency between two packages:
"DependencyType determines the type of the dependency."
enum DependencyType {
"direct dependency"
DIRECT
"indirect dependency"
INDIRECT
"type not known/not specified"
UNKNOWN
}
"IsDependency is an attestation to record that a package depends on another."
type IsDependency {
id: ID!
"Package that has the dependency"
package: Package!
"Package for the dependency; MUST BE PackageName, not PackageVersion"
dependentPackage: Package!
"Version range for the dependency link"
versionRange: String!
"Type of dependency"
dependencyType: DependencyType!
"Justification for the attested relationship"
justification: String!
"Document from which this attestation is generated from"
origin: String!
"GUAC collector for the document"
collector: String!
}
- The above schema contains the necessary information to create the dependency relationship. We can map this schema or a representation of this, for the graph database to store and retrieve the data needed to recreate the “isDependency” node.
- Optimizing the query language: Many graph databases (such as ArangoDB or Neo4j) come with their own query language that can be used to ingest and read data. Optimizing the query for the most efficient reads and writes can be challenging and may require evaluation/inspection to determine where the costly operations are occurring. For example, a query to ingest dependencies may look like the following in Arango Query Language (AQL) which is quite complex:
LET firstPkg = FIRST(
FOR pkg IN Pkg
FILTER pkg.root == 'pkg'
FOR pkgHasType IN OUTBOUND pkg PkgHasType
FILTER pkgHasType.type == doc.pkgType && pkgHasType._parent == pkg._id
FOR pkgHasNamespace IN OUTBOUND pkgHasType PkgHasNamespace
FILTER pkgHasNamespace.namespace == doc.namespace && pkgHasNamespace._parent == pkgHasType._id
FOR pkgHasName IN OUTBOUND pkgHasNamespace PkgHasName
FILTER pkgHasName.name == doc.name && pkgHasName._parent == pkgHasNamespace._id
FOR pkgHasVersion IN OUTBOUND pkgHasName PkgHasVersion
FILTER pkgHasVersion.version == doc.version && pkgHasVersion.subpath == doc.subpath &&
pkgHasVersion.qualifier_list == doc.qualifier && pkgHasVersion._parent == pkgHasName._id
RETURN {
'type': pkgHasType.type,
'namespace': pkgHasNamespace.namespace,
'name': pkgHasName.name,
'version': pkgHasVersion.version,
'subpath': pkgHasVersion.subpath,
'qualifier_list': pkgHasVersion.qualifier_list,
'versionDoc': pkgHasVersion
}
)
LET secondPkg = FIRST(
FOR pkg IN Pkg
FILTER pkg.root == 'pkg'
FOR pkgHasType IN OUTBOUND pkg PkgHasType
FILTER pkgHasType.type == doc.secondPkgType && pkgHasType._parent == pkg._id
FOR pkgHasNamespace IN OUTBOUND pkgHasType PkgHasNamespace
FILTER pkgHasNamespace.namespace == doc.secondNamespace && pkgHasNamespace._parent == pkgHasType._id
FOR pkgHasName IN OUTBOUND pkgHasNamespace PkgHasName
FILTER pkgHasName.name == doc.secondName && pkgHasName._parent == pkgHasNamespace._id
RETURN {
'type': pkgHasType.type,
'namespace': pkgHasNamespace.namespace,
'name': pkgHasName.name,
'nameDoc': pkgHasName
}
)
LET isDependency = FIRST(
UPSERT { packageID:firstPkg.versionDoc._id, depPackageID:secondPkg.nameDoc._id, versionRange:doc.versionRange,
dependencyType:doc.dependencyType, justification:doc.justification, collector:doc.collector, origin:doc.origin }
INSERT { packageID:firstPkg.versionDoc._id, depPackageID:secondPkg.nameDoc._id, versionRange:doc.versionRange,
dependencyType:doc.dependencyType, justification:doc.justification, collector:doc.collector, origin:doc.origin }
UPDATE {} IN isDependencies
RETURN NEW
)
LET edgeCollection = (FOR edgeData IN [
{fromKey: isDependency._key, toKey: secondPkg.nameDoc._key, from: isDependency._id, to: secondPkg.nameDoc._id,
label: 'dependency'},
{fromKey: firstPkg.versionDoc._key, toKey: isDependency._key, from: firstPkg.versionDoc._id, to: isDependency._id,
label: 'subject'}]
INSERT { _key: CONCAT('isDependencyEdges', edgeData.fromKey, edgeData.toKey), _from: edgeData.from, _to:
edgeData.to, label : edgeData.label } INTO isDependencyEdges OPTIONS { overwriteMode: 'ignore' }
)
RETURN {
'firstPkgType': firstPkg.type,
'firstPkgNamespace': firstPkg.namespace,
'firstPkgName': firstPkg.name,
'firstPkgVersion': firstPkg.version,
'firstPkgSubpath': firstPkg.subpath,
'firstPkgQualifier_list': firstPkg.qualifier_list,
'secondPkgType': secondPkg.type,
'secondPkgNamespace': secondPkg.namespace,
'secondPkgName': secondPkg.name,
'versionRange': isDependency.versionRange,
'dependencyType': isDependency.dependencyType,
'justification': isDependency.justification,
'collector': isDependency.collector,
'origin': isDependency.origin
}
- While in neo4j, it would be:
MATCH (root:Pkg)-[:PkgHasType]->(type:PkgType)-[:PkgHasNamespace]->(namespace:PkgNamespace)-[:PkgHasName]->(name:PkgName)-[:PkgHasVersion]->(version:PkgVersion), (objPkgRoot:Pkg)-[:PkgHasType]->(objPkgType:PkgType)-[:PkgHasNamespace]->(objPkgNamespace:PkgNamespace)-[:PkgHasName]->(objPkgName:PkgName)
WHERE type.type = $pkgType AND namespace.namespace = $pkgNamespace AND name.name = $pkgName AND version.version = $pkgVersion AND version.subpath = $pkgSubpath AND version.qualifier_list = $pkgQualifierList AND objPkgType.type = $objPkgType AND objPkgNamespace.namespace = $objPkgNamespace AND objPkgName.name = $objPkgName
MERGE (version)<-[:subject]-
(isDependency:IsDependency{versionRange:$versionRange,dependencyType:$dependencyType,justification:$justification,origin:$origin,collector:$collector})-[:dependency]->(objPkgName)
RETURN type.type, namespace.namespace, name.name, version.version, version.subpath, version.qualifier_list, isDependency, objPkgType.type, objPkgNamespace.namespace, objPkgName.name
- Both queries are different but are trying to achieve the same operations, ingestion of the “isDependency” node. For each language, the query must have efficient traversal, proper indexing and limit the number of nodes that have to be filtered. We also need to remove unnecessary “UPSERT” or “MERGE” which either “INSERT” or “UPDATE” depending on the node which can be costly operations. These queries have to be carefully crafted for each language utilizing their own unique syntax.
Quantitative Analysis
As we work towards providing a persistent database for GUAC, the above characteristics need to be measured and quantified. Currently, we are in the process of evaluating ArangoDB, Neo4J and Cloud Spanner. The GUAC community is also evaluating AWS Neptune and ENT (an ORM framework that supports multiple backends such as SQLite, MySQL, and others). The flexibility of GUAC, via its graphQL interface, is that we are not limited to a singular database. We have the ability to support multiple depending on the use cases but maintenance can be troublesome. Therefore, we must do our due diligence to determine which databases should be incorporated as the persistent backend for GUAC. Stay tuned for our next blog where we go into further detail about the evaluations of the above characteristics in the databases we mentioned above. We will go into specific details and provide real word metrics around ingestion and execution times.
To learn more about GUAC, visit the official site at https://guac.sh, the docs at https://docs.guac.sh, or the source code repository at https://github.com/guacsec/guac (Give us a star while you’re there!).
Like what you read? Share it with others.