Compute Canada and the Canadian Association of Research Libraries (CARL) have entered into an agreement to create a production-scale system for research data management in Canada. The solution will leverage existing platforms and custom software to provide a solution for data ingestion, curation, discovery, transfer and geo-replication. The system is designed to be federated (bring your own storage) and scalable to a national level and is a unique model globally. In collaboration with Globus, and the CARL Portage project, a team of Compute Canada developers have been working on this solution since January 2016 and demonstrations were held at this year’s High Performance Computing Symposium in Edmonton.
Research data management (RDM) is a key element of the national data infrastructure under development by Compute Canada and partners. Key attributes for this federated model for RDM include:
- Access to Globus file transfer, a data transfer and management service suitable for high-speed transfer among national sites and elsewhere.
- Scalable model: The tools chosen for data transfer, replication and discovery are already used to move 100s of petabytes of data around the world. While most institutional solutions have been designed to accommodate small files moving short distances, Compute Canada’s solution can scale to the distances and dataset sizes required of a national system.
- National data discovery: Different data collections can be hosted in different locations, with appropriate access controls and metadata, these various data collections will be searchable from a single web-based tool.
- Data preservation pipeline: Many repository systems do not have long-term data preservation as part of their design. Compute Canada’s solution is also an archivist tool suite capable of normalizing data to long-term storage formats (Archival Information Packages). While this is a standalone tool suite, we have successfully tested it as part of a processing chain.
- Suitable for a broad range of data types: Compute Canada’s solution is suitable to manage a wide variety of data types from a wide range of disciplines.
- Automated geographic data replication: The storage solution takes advantage Compute Canada’s new national data infrastructure to host geographically dispersed replicas and backups of datasets.
- Bulk data and metadata ingestion: Canadian researchers manage a large quantity of research data. Our solution includes the addition of software scripts for large batch ingestion of data and metadata.
- Access control mechanisms: A high level of precise control over the users, groups and organizations that access or view sensitive data.