Supercomputer Mass Tape Migration

Tape Ark is an expert at migrating large volumes of data from active supercomputers and high-performance computing environments and has performed the largest migration of tape-based data ever completed in history.

Due to rapid changes in processing technology, quantum computing, and high-performance computing (HPC), supercomputer facilities are looking to migrate their data to new platforms to take advantage of this new technology. However, the facilities tend to be anchored to data created in the old system and often held in massive multi-petabyte or even exabyte collections that are wrapped up in proprietary formats.

Being tethered to these legacy tape collections can slow or even prevent companies from being able to take advantage of the latest technology. At Tape Ark, we believe some of the most important discoveries of our time will be made using historical content combined with new technology to derive profound results. The results may be in the form of higher resolution output, deeper learning or better access to artificial intelligence (AI) and machine learning (ML), at new scales.

Tape Ark has the background to move multi-hundred petabyte or exabyte scale data collections to and from old and new environments – even when these environments are still live and in active tape libraries and robotic silos.

Tape Ark can migrate to and from a variety of inputs and outputs and tape formats. Whether it is tape to tape, tape to disk, tape to cloud or tape to virtual tape library (VTL) on disk or cloud, we have you covered.

We have completed projects that require bespoke decoding of data formats, duplicate detection on massive scales, and file renaming and mapping measured in the billions of files. See the complete list of data formats, tape formats and robotic tape libraries that we work with. Don’t see the format you are interested in, reach out.

Supercomputing facilities can handle a wide variety of data formats that help support a range of scientific and engineering applications. Some of the common data formats found in these facilities include:

Scientific Data Formats

NetCDF (Network Common Data Form): Commonly used for array-oriented scientific data, especially in climate and weather models.
HDF5 (Hierarchical Data Format version 5): Used for storing large amounts of numerical data and supporting complex data models, common in astronomy, physics, and environmental science.
CDF (Common Data Format): Used in space science and related fields.
FITS (Flexible Image Transport System): Standard format in astronomy for storing, transmitting, and analyzing scientific datasets, especially images.

Text and Binary Data Formats

CSV (Comma-Separated Values): Simple text format for tabular data.
JSON (JavaScript Object Notation): Text format for structured data interchange, commonly used in web applications and data exchange.
XML (eXtensible Markup Language): Flexible text format for structured data, widely used for data interchange.
Binary Files: Custom binary formats tailored to specific applications for performance reasons.

Simulation and Modeling Data Formats

PDB (Protein Data Bank): Format used for 3D data of large molecules, primarily proteins and nucleic acids.
XYZ Files: Simple text file format used to represent atomic coordinates in computational chemistry.
VTK (Visualization Toolkit): Format used for storing 3D data structures used in computational fluid dynamics and other simulations.
LAMMPS Dump Files: Format used by the LAMMPS molecular dynamics simulator.

Image and Volume Data Formats

TIFF (Tagged Image File Format): Used for storing raster graphics images, popular in medical imaging and remote sensing.
NIfTI (Neuroimaging Informatics Technology Initiative): Standard format for neuroimaging data.
DICOM (Digital Imaging and Communications in Medicine): Standard for storing and transmitting medical imaging information.

Specialized Data Formats

ROOT: Data analysis framework used by high energy physics community.
SPSS/HPSS/SAS/Stata Files: Used in statistical analysis.
Parquet: Columnar storage format used in data processing frameworks like Apache Hadoop and Apache Spark.

Parallel File System Formats

GPFS (General Parallel File System): High-performance clustered file system.
Lustre: Parallel distributed file system, commonly used in large-scale cluster computing.

These formats help supercomputing facilities manage the complexity, size, volume, and performance requirements of their workloads.

Benefits

Secure data restore and reformat

Fast, high volume ingest

Enhanced data utility

Streamlined compliance

Evergreen data preservation

Modernize legacy systems

Cloud agnostic

Solutions

Tape Ark prides itself on its technical capability. Customers feel assured their objectives and unique challenges are well understood – what’s important to them, where they’re headed, and the opportunities that lie ahead.

Take the next step

Services

Take the next step

Take the next step

Take the next step

Take the next step

Supercomputer Mass Tape Migration