Skip to content

Preprocessing

The SILO preprocessing accepts input data in two formats:

  • NDJSON: a single NDJSON file containing all the data,
  • TSV/FASTA: a directory containing
    • a TSV file with the metadata
    • FASTA files with the sequences

The preprocessing configuration file determines which format should be used.

Preprocessing Configuration

The preprocessing configuration file is a YAML file that allows the keys shown in the table below. All keys are optional and have default values. Some keys are relevant only for one of the two input file formats.

KeyInput FormatDefaultDefault in Docker Image
inputDirectoryboth./ (current working directory)/preprocessing/input/
outputDirectoryboth./output//preprocessing/output/
intermediateResultsDirectoryboth./temp//preprocessing/temp/
preprocessingDatabaseLocationboth(absent)
ndjsonInputFilenameNDJSON(absent)
metadataFilenameTSV/FASTAmetadata.tsv
pangoLineageDefinitionFilenameboth(absent)
referenceGenomeFilenamebothreference_genomes.json
nucleotideSequencePrefixTSV/FASTAnuc_
genePrefixTSV/FASTAgene_
unalignedNucleotideSequencePrefixTSV/FASTAunaligned_

Description of Keys for Both Formats

  • inputDirectory: The directory where input files are located.
  • outputDirectory: The directory where output files will be placed.
  • intermediateResultsDirectory: The directory for storing intermediate results not relevant to the end user, mainly for debugging.
  • preprocessingDatabaseLocation: The file for storing internal, intermediate database states for debugging.
  • pangoLineageDefinitionFilename: The file with Pango lineage definitions, relative to the inputDirectory. See the section on the Pango Lineage Definition File below for details.
  • referenceGenomeFilename: The file with reference genomes, relative to the inputDirectory.

NDJSON Format

SILO will initiate preprocessing in the NDJSON format if ndjsonInputFilename is specified in the preprocessing configuration.

Each line in the NDJSON file must be a JSON object with the following keys:

KeyTypeDescription
metadataobjectAn object containing all metadata as key-value pairs.
unalignedNucleotideSequencesobjectA sequences object with unaligned nucleotide sequences.
alignedNucleotideSequencesobjectA sequences object with aligned nucleotide sequences.
alignedAminoAcidSequencesobjectA sequences object with aligned amino acid sequences.
aminoAcidInsertionsobjectAn insertions object with amino acid insertions.
nucleotideInsertionsobjectAn insertions object with nucleotide insertions.

Sequences Object

The sequences object contains sequences for each segment or gene. It must include all nucleotideSequences (or genes, respectively) specified in the reference genomes as keys. Its values are the sequences as strings of valid symbols or null.

Insertions Object

The insertions object contains a list of insertions for each segment or gene. It must include all nucleotideSequences (or genes, respectively) specified in the reference genomes as keys. Its values are arrays of strings in the format <position>:<insertion>. The insertions must consist of valid symbols.

Example of the Schema

{
"metadata": {
"primaryKey": "sequence001",
"pango_lineage": "B.1.1.7",
"region": null,
"age": 46,
"qc_value": 0.98
},
"unalignedNucleotideSequences": {
"segment1": "CGATA",
"segment2": "ACG"
},
"alignedNucleotideSequences": {
"segment1": "CGATAAT",
"segment2": "ACGT"
},
"alignedAminoAcidSequences": {
"gene1": "MYSLV*",
"gene2": "MADVQ*",
"gene3": "MSLYVQ*"
},
"nucleotideInsertions": {
"segment1": ["3:G", "4:A"],
"segment2": ["2:GTT"]
},
"aminoAcidInsertions": {
"gene1": ["3:EPE", "4:Q"],
"gene2": [],
"gene3": []
}
}

TSV/FASTA Format

SILO will initiate preprocessing in the TSV/FASTA format if metadataFilename is specified in the preprocessing configuration.

SILO expects the following files in the inputDirectory:

  • a TSV file with the metadata named as configured in metadataFilename,
  • FASTA files with the sequences.

Metadata File

The metadata file must be a TSV (tab-separated values) file. Its columns must correspond to the metadata fields specified in the database configuration. Empty values will be interpreted as null.

Example

Given the following database configuration:

schema:
metadata:
- name: primaryKey
type: string
- name: pango_lineage
type: pango_lineage
- name: region
type: string
- name: age
type: int
- name: qc_value
type: float
- name: insertions
type: insertion
- name: aaInsertions
type: aaInsertion
# other configuration keys ...

The metadata file might look as follows:

primaryKey	pango_lineage	region	age	qc_value	insertions	aaInsertions
sequence001	B.1.1.7	Europe	46	0.98	segment1:123:AAA,segment2:456:GTT	gene1:123:EPE
sequence002	B.1.1.7		46	0.98		gene2:123:EPE,gene2:125:EPE

Sequence Files

In the TSV/FASTA format, sequences must be stored in separate FASTA files. The filenames must follow this pattern:

There must be one corresponding file for every segment and gene defined in the reference genomes.

The header in the FASTA files must match the primaryKey column in the metadata file. There must be a one-to-one correspondence between entries in the metadata file and sequences in the FASTA files.

Example

Given the reference genomes:

{
"segments": [
{ "name": "segment1", "sequence": "/*...*/" },
{ "name": "segment2", "sequence": "/*...*/" }
],
"genes": [
{ "name": "gene1", "sequence": "/*...*/" },
{ "name": "gene2", "sequence": "/*...*/" },
{ "name": "gene3", "sequence": "/*...*/" }
]
}

the input directory should contain the following files:

input/
├── gene_gene1.fasta
├── gene_gene2.fasta
├── gene_gene3.fasta
├── nuc_segment1.fasta
├── nuc_segment2.fasta
└── /* other files... */

The file nuc_segment1.fasta might look as follows— assuming that the metadata file also has two entries with the primary keys sequence001 and sequence002:

>sequence001
CGATAAT
>sequence002
CGATAAT

The Pango Lineage Definition File

This file is relevant only if your data includes Pango Lineages.

The Pango lineage definition file is a JSON file mapping Pango lineage names to their aliases. It is used to reconstruct the lineage tree structure. SILO requires this to properly group sequences into partitions to fully benefit from partitioning.

The file contains a JSON object with alias names as keys and:

  • an empty string if the alias is a root node,
  • the name of the parent node if the alias is a child node,
  • an array of parent nodes if the alias is a recombinant.

Here is a minimal example:

{
"A": "",
"B": "A.1.1.1",
"XA": ["B.1.2", "B.1.42"]
}

A complete example can be found here: https://github.com/cov-lineages/pango-designation/blob/master/pango_designation/alias_key.json