Gating-ML 2.0 support in FlowRepository

# Summary

Gating-ML 2.0 represents a standard developed by the International Society for Advancement of Cytometry (ISAC) for computer interchangable and unambiguous XML-based gate definitions. The specification supports rectangular gates in n dimensions (i.e., from one-dimensional range gates up to n-dimensional hyper-rectangular regions), quadrant gates in n dimensions, polygon gates, ellipsoid gates in n dimensions, and Boolean collections of any of the types of gates. Gates can be uniquely identified and may be ordered into a hierarchical structure to describe a gating strategy. Transformations, including compensation description are included as part of the Gating-ML specification. FlowRepository supports Gating-ML 2.0 import and export, which may be used for the interchange of gating with other software tools supporting the Gating-ML 2.0 specification.

There are certain differences between what can be described by Gating-ML version 2.0 and what is internally supported by FlowRepository. These differences include supported gate types, scale definitions, and other details.

All gates and population definitions saved in FlowRepository can be exported to Gating-ML; however, this export involves certain conversions that are necessary to adapt to the gate and transformation types supported by Gating-ML. Details about these are described in the Gating-ML export section. In addition, FlowRepository’s internal gating information includes mapping of tailored gates to FCS files that the gates are applicable to. This sort of information is not formally describable in Gating-ML (the International Society for Analytical Cytology - ISAC - is developing a separate standard for the description of this sort of relations). Consequently, this mapping is not formally included in the exported Gating-ML. However, we are using certain proprietary mechanisms in order to be able to reconstruct some of this mapping if the Gating-ML file has been exported from FlowRepository.

Gating-ML 2.0 documents may include certain gates or data transformations that are impossible to represent internally in FlowRepository. As detailed in the Gating-ML import section, we are trying to approximate these with internally supported gate and transformation types where possible.

# Gating-ML export

## Workflow

Gating-ML 2.0 expects the following steps to be followed when gating:

- Read and linearize data from a list mode data file (i.e., read data from an FCS file and perform the “channel to scale” conversion based on FCS keywords, such as $PnE, $PnG etc.).
- Compensate the data (if required).
- Perform a “visualization” transformation (e.g., ASinH, Logicle, etc.).
- Apply the gates as “data filters” to produce sub-populations of events.

Consequently, gates in Gating-ML 2.0 are described in the “visualization space”. In FlowRepository, there is an additional step of “binning” performed after the visualization transformation. Consequently, gates in FlowRepository are described in the “bin space”. Therefore, for the Gating-ML 2.0 export, the gate coordinates have to be “unbinned” in order to bring them to the visualization space. In addition, minor adjustments due to a different scale transformation may need to be performed (see Scale transformations below).

## Compensation

Gating-ML 2.0 supports the following choices for compensation description:

- Use uncompensated data.
- Compensate data as prescribed by the data file (e.g., $SPILLOVER keyword in the FCS file).
- Provide a custom compensation description within the Gating-ML 2.0 file.

These choices correspond to compensation options used in FlowRepository internally. However, there is a difference in how custom compensation is described. FlowRepository uses a typical spillover matrix. This is a square matrix with rows and columns corresponding to FCS channels for which the compensation description is applicable. The compensation of an event is then performed by multiplying the event vector by the inverse of the spillover matrix (assuming the channels in the event vector correspond exactly to the channels in the spillover matrix).

Gating-ML 2.0 uses spectrum matrices. A spectrum matrix represent a generalization of a spillover matrix where the number of detectors may exceed the number of fluorochromes (dyes) used in the experiment. This has been introduced by ISAC in order to respond to the introduction of certain new instruments on the market (produced mainly by Sony). Even in this case, we assume that the detector measurements are a linear combination of the amounts of the various dyes present, and some errors or noise in the system. Therefore, we seek a linear combination of the measured values that is an estimator of the amount of each dye. For n dyes (fluorochromes) and m detectors, we have an n × m spectrum matrix and we seek an m × n unmixing matrix (e.g., Moore–Penrose pseudoinverse). In order to compensate, we will multiply each m vector of measurement values by the unmixing matrix and get an n vector of dye estimates.

Each spillover matrix can be described as a Gating-ML 2.0 spectrum matrix and therefore, custom compensations in FlowRepository are exported as spectrum matrices in Gating-ML 2.0. The values in these two matrices are the same. However, there is a difference in how rows and columns are identified. With spillover matrices, the row and column names correspond to each other. Spectrum matrices use two distinct sets of names for rows (fluorochromes) and columns (detectors). In order to save a spillover matrix as a spectrum matrix in Gating-ML 2.0, we create a new set of channel names for the compensated channels by adding a “Comp_” prefix. Channels compensated according to this matrix are then referenced by their new names in the Gating-ML file. If a FlowRepository gate references a custom compensation, but a particular channel is not included in that compensation description (e.g., FCS-A / PE-A gate), then that channel is specified as uncompensated in the exported Gating-ML file.

## Scale transformations

FlowRepository supports the following scale (visualization) transformations:

- Linear scale: f(x) = x.
- Log scale: f(x) = log(x); where log is a decadic logarithm.
- Arcsinh scale: f(x) = asinh(x / c); where asinh is the arcus hyperbolic sine function, and where c is a “compression width”.

There are corresponding visualization transformations supported by Gating-ML 2.0; however, their parameterization is different. In general, Gating-ML 2.0 transformations use the following parameters (not all of them are always applicable):

- T: the top of scale value (i.e., the maximum value expected after compensation).
- M: the number of “decades”.
- W: the number of “decades” in the approximately linear region.
- A: the number of “additional negative decades”.

Using these parameters, transformations are designed so that

- “Reasonable” values are mapped to the [0,1] interval. The semantic of the term “reasonable” is dependent on the actual scale transformation. Typically, it means less than or equal to T, but additional restrictions may apply (e.g., greater than zero for logarithmic transformations).
- If a software application does not support a particular scale type and has to use a different visualization transformation instead, then the differences in the resulting populations should be relatively small provided the same values for T, M, W, and A are used (as applicable).

All FlowRepository’s visualization transformations are describable in Gating-ML and the conversion is performed as “unbinning” followed by:

- Linear scale, f(x) = x: No visualization transformation is referenced in the Gating-ML file, which means that no visualization transformation shall be performed. This corresponds exactly to FlowRepository’s linear scale.
- Log scale, f(x) = log(x): Gating-ML’s “flog” transformation is used. This transformation is defined as flog(x, M, T) = (1 / M) * log(x / T) + 1. We set M = 1 and T = 1, which results in the transformation flog(x, 1, 1) = log(x) + 1. In addition, when exporting a gate, we increment appropriate unbinned coordinate values by 1, which will then define exactly the same populations in the exported Gating-ML as they are defined internally in FlowRepository.
- Arcsinh scale, f(x) = asinh(x / c): Gating-ML’s “fasinh” transformation is used. This transformation is defined as

fasinh(x, M, T, A) = (asinh(x * sinh(M * ln(10)) / T) + A * ln(10)) / ((M + A) * ln(10)). We set M = 1 / ln(10), T = c * sinh(1), and A = 0, which gives us

fasinh(x, 1 / ln(10), c * sinh(1), 0) = asinh(x * sinh(1) / T) / (M * ln(10)) = asinh(x / c). This corresponds exactly to FlowRepository’s Arcsinh scale.

By defining the visualization transformations and saving the gates as described above, the exported Gating-ML file defines the same populations as defined internally in FlowRepository. However, please note that in order to match the transformations exactly, we are required to choose “unusual” values for the T, M and A parameters. Consequently, the results of the transformations do not fall into the [0,1] interval.

## Gates

### Gate dimensions

In Gating-ML, gating dimensions are referenced by short channel names (e.g., $PnN values in FCS files). In FlowRepository, gating dimensions are referenced by channel indexes. In addition, a gate in FlowRepository may reference an FCS file that was used to create the gate (although the gate may still be applicable to many FCS files).

If there is an FCS file associated with the gate that is being exported, then the channels of this FCS file are used to convert channel indexes (used in FlowRepository’s gate definition) to short channel names (used in Gating-ML). If no FCS file is associated with the gate, then all FCS files associated with the particular experiment are considered in order to create all possible channel name combinations that match the channel indexes of that gate. The gate is then exported multiple times; once for each of the channel combinations.

### Gate identifiers and names

Gates in Gating-ML are identified by unique identifiers (xsd:ID). An xsd:ID must start with a letter or underscore, and can only contain letters, digits, underscores, hyphens, and periods. In Gating-ML, there is no standardized way to store a name of a gate (but it can be saved in the custom metadata section, see below). Therefore, we choose to create gate identifiers as the concatenation of “Gate_”, gate ID, "_", encoded gate channels, “_” and an encoded gate name, where gate ID is FlowRepository’s internal (numeric) gate identifier and the encoded gate name is created by encoding FlowRepository’s name of the gate in Base64 and replacing ‘=’, ‘+’, and ‘/’ with ‘.’, ‘_’ and ‘-’, respectively. Encoded channels are created by encoding the short channel name(s) the same way. This creates a safe and valid XML identifier.

### Rectangle and range gates

Gating-ML supports n-dimensional “rectangular” gates, which are intended to encode one-dimensional range gates, two-dimensional rectangle gates as well as multi-dimensional hyper rectangular regions. We are utilizing these rectangular gates in order to export rectangle and range.

In addition, rectangular gates in Gating-ML may have an “open end”; we could be using this feature to export split range gates, where either the minimum (for the left split range gate), or the maximum (for the right split range gate) are unbounded. However, we decided to use “1 dimensional quad gates” instead (see below), since these include the notion of connecting the two split range gates together into a single split gate.

### Quadrant and split gates

Gating-ML supports n-dimensional quadrant gates, which we are using to encode orthogonal quadrant gates and split gates. In Gating-ML, the n-dimensional space is divided by a set of dividers (at least one divider per dimension). Quads are enclosed by these dividers and each Quad is defined by setting a representative n-dimensional point; the point’s position defines the quad’s position with respect to the surrounding dividers.

We are using one dimensional quadrants to encode FlowRepository’s split gates in Gating-ML. The Gating-ML divider is set based on the split divider, and the two split ranges are defined by setting representative (1-dimensional) points to the value of the split divider +/- 1. Encoding FlowRepository’s split gates cover also the related split range gates.

We are using two dimensional quadrants to encode FlowRepository’s quadrant gates in Gating-ML. The two Gating-ML dividers (one for each dimension) are set based on the x and y position of the quadrant, and the four quad gates are defined by setting representative points to [x+1,y+1] (UR quad), [x-1,y+1] (UL quad), [x-1,y-1] (LL quad), and [x+1,y-1] (LR quad). Encoding FlowRepository’s quadrant gates also covers the related quads.

### Ellipse gates

Gating-ML supports n-dimensional ellipsoid gates defined by the center and a covariance matrix. The center point encodes the position of the ellipsoid and the covariance matrix encodes its size and shape. The advantage of the covariance approach becomes apparent with three or more dimensions. We are using two-dimensional ellipsoids to encode FlowRepository’s ellipse gates in Gating-ML. In two dimensions, there is a straightforward way to convert FlowRepository’s representation to the covariance matrix-based representation and back (the direction and length of the half axes corresponds to the eigenvectors and eigenvalues of the covariance matrix).

### Polygon gates

Polygon gates are encoded as a sequence of vertices in both, FlowRepository and Gating-ML. Therefore, the conversion from FlowRepository’s internal representation is straightforward.

## Gate sets

Gating-ML does not distinguish gates and population definitions. In Gating-ML, every gate defines a populations, and there are Boolean gates (i.e., AND, OR and NOT), which can be used to create a gate as a combination of other gates. Consequently, a Boolean AND gate is used to export FlowRepository’s gate sets (population definitions) to Gating-ML.

In case a population is defined by a single gate, the population definition is essentially a duplication of an existing gate; however, we choose to still save it as an AND gate in order to be explicit about the fact that such a population has been defined. In addition, Gating-ML requires at least two arguments for the AND gate. Therefore, we are repeating the same gate twice as two arguments of the AND gate in case a population is defined by a single gate.

Gating-ML does not include the notion of tailored gates. In fact, Gating-ML does not cover the information about which gate is applicable to which data file (ISAC is developing a separate standard for the description of this). If there are tailored gates referenced from FlowRepository’s gate set definition, then we are exporting several AND gates in Gating-ML, one for each of the FCS files that has any tailored gates in that gate set associated with that FCS file. The appropriate tailored gates are chosen as arguments of the AND gate.

The identifier of that AND gate is created as the concatenation of “GateSet_”, gate set ID, "_", encoded FCS file name (for tailored GateSets only), and encoded short channel names. Encoding means encoding in Base64 and replacing ‘=’, ‘+’, and ‘/’ with ‘.’, ‘_’ and ‘-’, respectively. This creates a safe and valid XML identifier.

## Custom metadata

Gating-ML 2.0 allows custom information to be added to the Gating-ML file. This can be done at the top level of the XML, and also for each of the gates in the Gating-ML file.

We use the top level custom information to provide the following custom information:

- The fact that this Gating-ML file is an export from FlowRepository
- The FlowRepository experiment number
- The FlowRepository experiment title
- The FlowRepository URL of the experiment
- The date and time stamp of the Gating-ML export

We use the gate level custom information to provide the following custom information:

- Gate name
- Gate id
- Gate type
- Related compensation id
- The x and y position of the label
- The x and y positions of labels of 4 “quads” of a Quadrant gate or 2 “Split ranges” of a Split gate.
- The y coordinate of 1 dimensional gates. (One dimensional gates in Gating-ML are using a single dimension only in the gate definition; we choose to also save “y” for drawing purposes; it specifies the position of the dotted horizontal line in a figure. This information is not required, but it can be reused to produce the same images if one-dimensional gates are exported from FlowRepository and imported into a different dataset).

Some of this custom information is reused in case of importing back Gating-ML files that have been exported from FlowRepository. This information is notcritical in terms of being able to properly reconstruct populations based on imported gates, but it can improve users’ experience by being able to capture some optional meta data and user preferences (e.g., position of gate labels).

## Differences between Gating-ML 2.0 export from Cytobank and FlowRepository

Cytobank and FlowRepository are using similar approaches to Gating-ML 2.0 export. However, there are a few differences as summarized below:

- FlowRepository stores gates in bin space while Cytobank and Gating-ML use visualization space. Therefore, FlowRepository includes unbinning as part of the Gating-ML export.
- FlowRepository references gate dimensions by channel indexes, while Cytobank and Gating-ML use a short channel name. While it is unlikely, it can happen that a single gate in FlowRepository is exported as several gates in Gating-ML in case the gate is applicable to several FCS files and these FCS files differ in the channels referenced by the gate. Similarly for gate sets (i.e., Boolean AND gates). Also, in order to guarantee uniqueness, encoded channel names are incorporated into the gate and gate set identifier in FlowRepository’s export (but not in Cytobank’s export).
- There are technical differences on the low level of the export due to different gate storage mechanisms (FlowRepository stores gates in XML while Cytobank stores gates in the database).
- Unlike Cytobank, FlowRepository does not support skewable quadrants and therefore, it does not have to convert these to polygon gates.
- Custom meta data are different in FlowRepository and Cytobank Gating-ML exports.

# Gating-ML import

## Required channels

In order to be able to import gates and custom compensation descriptions, the dimensions (channels) that are used to describe these gates and compensations must be present in at least one of the FCS data files associated with the dataset that is importing the Gating-ML file. If this is not the case, the gate (or compensation) cannot be imported, since it is not applicable to the experiment.

## Custom compensation

The custom compensation in Gating-ML 2.0 may include non-square spectrum matrices used for spectral unmixing in cases where there are more detectors used by the instrument than dyes in the sample. Since FlowRepository does not support spectral unmixing, it requires the Gating-ML spectrum matrix to be a square matrix with a direct one-to-one correspondence between dyes and detectors. In addition, it requires that 1s are placed on the diagonal of the matrix (which is normally the case). If these conditions are not met, the spectrum matrix cannot be imported.

Additional compensation options, such as the “use of uncompensated data” and “use of data compensated as prescribed by the data file” are directly translated to these options in FlowRepository.

## Scale transformations

Gating-ML scale transformations are mapped to FlowRepository scales as follows:

- Linear or no transformation in Gating-ML is mapped to Linear scale in FlowRepository.
- Log transformation in Gating-ML is mapped to Log scale in FlowRepository.
- ArcSinH, Logicle or Hyperlog transformation in Gating-ML is mapped to ArcSinH scale in FlowRepository. FlowRepository’s compression width c is set to T/sinh(1), where T is the parameter of the ArcSinH, Logicle or Hyperlog transformation in Gating-ML. This is the inverse of what we are doing when saving FlowRepository’s ArcSinH scale in Gating-ML, see section Gating-ML import/scale transformations above.

Every time that either the scale type, or the scale parameterization in Gating-ML differs from what FlowRepository can use internally, the import repositions and adjusts the gate accordingly, and notifies the user about the adjustment.

The following procedure is used for repositioning and adjusting the gate: Using the scale transformation in Gating-ML, gate “points” are converted to the raw data space. After that, FlowRepository’s scale is applied, and the gate is saved in the binned visualization space. Depending on the gate type, “points” mean vertices of a polygon or rectangular gate, center of a quad or split gate, edge coordinates of a range gate, and representative points of an ellipse. Representative points of an ellipse are created as the union of ellipse handle points with points where the ellipse intersects with it’s axis-aligned bounding box (i.e, points with maximum and minimum x and y coordinates). The resulting ellipse is then fit back to these points after the conversion to FlowRepository’s space.

The gate repositioning and adjusting procedure guarantees that these points are exactly in the right places in terms of the new space that the gate is described in. However, due to slight differences in the original scales or scale parameterizations, populations enclosed by these gates may slightly differ from those that would be obtained if the exact scale was used. (A difference in the scale transformation is in nature similar to “curving” the edges of a gate.) For future improvements, one could achieve better precision by either implementing all the Gating-ML scales and parameterizations in FlowRepository internally, or by adding extra vertices by dividing each edge of a gate into multiple segments. Rectangle and ellipse gates would have to be approximated by polygon gates in this case.

Gating-ML also supports gating on new dimensions created as the ratio of two different channels. This functionality is not supported in FlowRepository and therefore, these gates cannot be imported.

## Gates

### Rectangle gates

Gating-ML supports n-dimensional rectangle gates. This covers range gates (1-dimensional), regular rectangle gates (2-dimensional), boxes (3-dimensional) and hyper-rectangular regions (4 or more dimensions). Currently, 1-dimensional Gating-ML rectangle gates are imported as range gates in FlowRepository, and 2-dimensional Gating-ML rectangle gates as rectangle gates in FlowRepository. The import of 3 or more dimensional rectangle gates is not supported since these don’t have a direct counterpart in FlowRepository. However, if it turns out that multidimensional rectangular gates are actually being used by third parties, we could consider breaking these down into several (1 or 2 dimensional rectangles), which would be referenced by a single “GateSet”. This would create an intersect of these rectangles and at the end define the same population.

### Polygon gates

Polygon gates in Gating-ML are imported as polygon gates in FlowRepository. The representations in Gating-ML and in FlowRepository are similar and the import is therefore just a matter of a straightforward conversion.

### Ellipsoid gates

Gating-ML supports ellipsoid gates in two or more dimensions. However, only two-dimensional ellipse gates can be imported to FlowRepository since FlowRepository does not support ellipsoids in more than 2 dimensions. Gating-ML uses the covariance matrix representation of ellipsoid gates, so these are converted to FlowRepository’s major/minor/angle representation as part of the Gating-ML import process.

### Quadrant gates

Gating-ML supports n-dimensional quadrant gates. One-dimensional quadrant gates are imported as Split gates in FlowRepository. Two dimensional Quadrant gates are imported as regular quadrant gates in FlowRepository. Gating-ML allows each dimension to be split at several positions, and not all dimensions need to be referenced from a Quad definition; these non standard quadrants are not supported by FlowRepository and cannot be imported at this time. If it turns out that they are actually being used by third party software, FlowRepository could consider importing some of these as n-dimensional rectangular gates.

### Boolean gates

Gating-ML supports Boolean AND (with 2 or more arguments), OR (with 2 or more arguments), and NOT (with 1 argument) gates. In addition, each “argument gate” may be considered as complement to the specified gate (i.e., A AND NOT B). FlowRepository does not support Boolean gates, however, Boolean AND gates are essentially how FlowRepository GateSets (i.e., Populations) are defined. Therefore, Boolean AND gates are imported as FlowRepository’s GateSets. If a gate reference is specified to be used as complement, then this gate cannot be imported to FlowRepository. Gating-ML Boolean OR and NOT gates cannot be imported since these concepts cannot be represented by FlowRepository at this points.

Gates in Gating-ML may have a “parent_id” attribute stating that the gate is supposed to be applied on the population defined by the “parent” gate. If that occurs, the parent gate will be added to the GateSet definition in FlowRepository. Several parent gates may be added in case a gate specifies a parent, who also specifies a parent etc.