Graph Datasets
xTorch provides support for graph-based machine learning tasks with a collection of standard graph datasets. These are essential for developing and benchmarking Graph Neural Networks (GNNs).
Graph datasets are located under the xt::datasets namespace and can be found in the <xtorch/datasets/graph_data/> header directory.
Graph Data Representation
Unlike image or text data, which is typically represented as a pair of (data, target) tensors, graph data has a more complex structure. In xTorch, a graph dataset typically returns a torch::data::Example containing multiple components:
x: A[num_nodes, num_node_features]tensor of node features.edge_index: A[2, num_edges]tensor representing the graph's connectivity in COO (coordinate) format. Each column is an edge.y: A tensor of node or graph labels, depending on the task.
The DataLoader for graph data is designed to handle this structure and create mini-batches appropriately.
General Usage
The workflow for using a graph dataset involves instantiating the dataset class and passing it to a data loader. Due to the nature of graph data, complex transformations are less common but still possible.
#include <xtorch/xtorch.h>
int main() {
// 1. Instantiate a dataset for the Cora citation network
// This dataset is commonly used for node classification.
auto dataset = xt::datasets::Cora(
"./data",
/*download=*/true
);
// Note: Graph datasets often represent a single large graph.
// The "size" might be 1, and batching is handled differently by specialized GNN data loaders.
std::cout << "Cora dataset loaded." << std::endl;
// For demonstration, let's get the single graph object from the dataset
auto graph_data = dataset.get(0);
auto node_features = graph_data.data;
auto edge_index = graph_data.target; // Example structure, might differ per dataset
std::cout << "Node feature shape: " << node_features.sizes() << std::endl;
std::cout << "Edge index shape: " << edge_index.sizes() << std::endl;
// 2. Pass the dataset to a DataLoader
// For GNNs, you might use a specialized graph data loader or a standard one with a batch size of 1
// if you are doing full-graph training.
xt::dataloaders::ExtendedDataLoader data_loader(dataset, /*batch_size=*/1, /*shuffle=*/false);
// The data loader is now ready for use in a training loop
for (auto& batch : data_loader) {
// ... training step with a GNN model ...
}
}!!! warning "Graph Batching"
Batching multiple graphs into a single larger graph (a common technique in GNNs) is a specialized process. While the ExtendedDataLoader can iterate over datasets, you may need custom collation logic for advanced GNN training scenarios. For full-graph training (where the entire graph is processed at once), a batch size of 1 is appropriate.
Available Datasets by Task
Node Classification
Node classification is the task of predicting a label for each node in a graph, given the labels of some nodes.
| Dataset Class | Description | Header File |
|---|---|---|
Cora |
A citation network dataset where nodes are documents and edges are citation links. The task is to classify each document into one of seven classes. | node_classification/cora.h |
Graph-Level Tasks (Graph Classification/Regression)
Graph-level tasks involve predicting a single property for an entire graph.
| Dataset Class | Description | Header File |
|---|---|---|
OGBMolHIV |
A molecular property prediction dataset from the Open Graph Benchmark. The task is to predict whether a molecule inhibits HIV virus replication. | molecular_property_prediction/ogb_mo_ihiv.h |
Knowledge Graph Reasoning
| Dataset Class | Description | Header File |
|---|---|---|
Freebase |
A subset of the Freebase knowledge graph used for link prediction tasks. | knowledge_graph_reasoning/freebase.h |
Wikidata5M |
A large-scale knowledge graph distilled from Wikidata and Wikipedia. | knowledge_graph_reasoning/wikidata_5m.h |
