# LAPIS - sars_cov-2_nextstrain_open

> LAPIS (Lightweight API for Sequences) instance for sars_cov-2_nextstrain_open.
> Query genomic sequence data with powerful mutation filters, metadata combinations, and Boolean logic.

This instance contains data for sars_cov-2_nextstrain_open.

The LAPIS code is open source and available at https://github.com/GenSpectrum/LAPIS.
LAPIS is a convenience API around SILO, a high-performance query engine for genomic sequences.
The code is available at https://github.com/GenSpectrum/LAPIS-SILO.
If you need more detailed information than mentioned in this file, refer to the [LAPIS docs](https://lapis.cov-spectrum.org/open/v2/docs/).

## Instance Configuration

### Metadata Fields

The following metadata fields are available for filtering on this instance:


- **ace2Binding** (FLOAT):
- **age** (INT):
- **authors** (STRING):
- **country** (STRING):
- **countryExposure** (STRING):
- **database** (STRING):
- **date** (DATE):
- **dateDay** (INT):
- **dateMonth** (INT):
- **dateOriginalValue** (STRING):
- **dateSubmitted** (DATE):
- **dateSubmittedDay** (INT):
- **dateSubmittedMonth** (INT):
- **dateSubmittedOriginalValue** (STRING):
- **dateSubmittedYear** (INT):
- **dateUpdated** (DATE):
- **dateUpdatedDay** (INT):
- **dateUpdatedMonth** (INT):
- **dateUpdatedOriginalValue** (STRING):
- **dateUpdatedYear** (INT):
- **dateYear** (INT):
- **division** (STRING):
- **divisionExposure** (STRING):
- **genbankAccession** (STRING):
- **genbankAccessionRev** (STRING):
- **gisaidClade** (STRING):
- **gisaidEpiIsl** (STRING):
- **host** (STRING):
- **immuneEscape** (FLOAT):
- **location** (STRING):
- **nextcladeCoverage** (FLOAT):
- **nextcladePangoLineage** (STRING, lineage index generated):
- **nextcladeQcFrameShiftsScore** (FLOAT):
- **nextcladeQcMissingDataScore** (FLOAT):
- **nextcladeQcMixedSitesScore** (FLOAT):
- **nextcladeQcOverallScore** (FLOAT):
- **nextcladeQcPrivateMutationsScore** (FLOAT):
- **nextcladeQcSnpClustersScore** (FLOAT):
- **nextcladeQcStopCodonsScore** (FLOAT):
- **nextstrainClade** (STRING):
- **originatingLab** (STRING):
- **pangoLineage** (STRING, lineage index generated):
- **region** (STRING):
- **regionExposure** (STRING):
- **samplingStrategy** (STRING):
- **sex** (STRING):
- **sraAccession** (STRING):
- **strain** (STRING):
- **submittingLab** (STRING):
- **whoClade** (STRING):
- **died** (BOOLEAN):
- **fullyVaccinated** (BOOLEAN):
- **hospitalized** (BOOLEAN):
- **nextcladeDatasetVersion** (STRING):
- **usherTree** (STRING):


### How to Filter by Metadata Fields

You can use metadata fields as filter parameters in your queries. The filter syntax depends on the field type:

- **String fields**: Use exact or regex match.
  You can also supply an array - the fields will be combined with logical OR.
  Examples: `"authors": "someValue"`, `"authors.regex": "^startsWithThis*"`, `"authors": ["someValue", "orOtherValue"]`
- **Lineage fields**: For string fields that also have a lineage index, you can filter for exact matches (`"nextcladePangoLineage": "lineage"`) or including sublineages (`"nextcladePangoLineage": "lineage*"`).
- **Date fields**: Use `From` and `To` suffixes for ranges. Example: `"dateFrom": "2023-01-01", "dateTo": "2023-12-31"`
- **Integer fields**: Use exact match or `From`/`To` for ranges. Example: `"age": 42` or `"ageFrom": 10, "ageTo": 50`
- **Float fields**: Use exact match or `From`/`To` for ranges. Example: `"ace2Binding": 0.95` or `"ace2BindingFrom": 0.8, "ace2BindingTo": 1.0`
- **Boolean fields**: Use `true` or `false`. Example: `"died": true`

You can combine multiple filters in a single query. All filters are combined with AND logic.
All exact filters also support filtering for `null`:
- `"authors.isNull": true` filters for `null` values.
- `"authors.isNull": false` filters for non-`null` values.

### Genes and Segments

This instance uses a single-segmented genome.

Available genes for amino acid queries: E, M, N, ORF1a, ORF1b, ORF3a, ORF6, ORF7a, ORF7b, ORF8, ORF9b, S

### How to Filter by Mutations

You can filter sequences by nucleotide and amino acid mutations.

**Nucleotide mutations:**
- **Point mutations**: Specify position and substitution. Example: `"nucleotideMutations": ["main:123T"]` (position 123 changed to thymine)
- **Deletions**: Use `-` for deleted bases. Example: `"nucleotideMutations": ["main:123-"]`
- **Insertions**: Use `nucleotideInsertions` filter. Example: `"nucleotideInsertions": ["ins_main:123:AAA"]` (3 adenines inserted after position 123)
- **Maybe mutations**: Query ambiguous positions. This will also match when there is an ambiguity symbol at this position that could include the mutation.
 Example: `"nucleotideMutations": ["MAYBE(main:123T)"]`

**Amino acid mutations:**
- **Point mutations**: Specify gene, position, and substitution. Example: `"aminoAcidMutations": ["E:484K"]` (E position 484, lysine)
- **Deletions**: Example: `"aminoAcidMutations": ["E:69-"]`
- **Insertions**: Use `aminoAcidInsertions` filter. Example: `"aminoAcidInsertions": ["ins_E:214:EPE"]` (EPE inserted after position 214 in E)
- supports "maybe" with the same syntax as nucleotide mutations.

**Boolean logic:** Multiple mutations in arrays are combined with AND.
Example: `"nucleotideMutations": ["main:123T", "main:456A"]` (both mutations required)

## API Endpoints

The OpenAPI spec is available at [api-docs](api-docs).
Refer to that if you need more details on an endpoint.

### Data Retrieval and Mutation Analysis

These are the primary entrypoints for analyzing the data in this LAPIS instance.

These endpoints are available as GET or POST.
Prefer POST since it allows more flexible requests.
Use GET when you want to have links that are easy to share since all their parameters can be passed as query parameters.

Every endpoint accepts filters on metadata fields and mutations.
Use these to narrow down the sequences that are included in the results.

Every endpoint also accepts:
- `limit`: Maximum number of result entries to return.
  This is especially useful for endpoints that return large responses otherwise.
- `offset`: Number of result entries to skip before starting to return results.
  Use this for pagination in combination with `limit`.
- `orderBy`: A list of fields that are contained in the response data that should be sorted by.
- `dataFormat`: By default, sequence endpoints return FASTA, the other endpoints return JSON.
  You can change this with the `dataFormat` parameter. Consult the OpenAPI spec for which formats are supported for which endpoints.
- `compression`: Can be used to enable response compression (`gzip` or `zstd`).

The response is usually a JSON object that contains:
- `data`: an array of result entries, if the query was successful.
- `error`: an error message, if the query failed. The message should usually give you an idea of what went wrong with your request.
- `info`: additional information about the LAPIS instance.
  The `dataVersion` field can be used to check if the underlying data has been updated since your last query.

Endpoints:
- [sample/aggregated](sample/aggregated): Count and group sequences by metadata and mutations.
  This is similar to a "select count(*) from ... group by <fields> where <filters>" SQL query.
- [sample/details](sample/details):
  Returns the actual metadata values for sequences that match your filters. Use this to get individual sequence records.
  Similar to a "select <fields ?? *> from ... where <filters>" SQL query.
- [sample/alignedNucleotideSequences](sample/alignedNucleotideSequences):
  Returns nucleotide sequences aligned to the reference genome in FASTA format.
  Usually used by users who want to download the sequences for offline analysis. Not recommended for large result sets.
  
- [sample/unalignedNucleotideSequences](sample/unalignedNucleotideSequences):
  Returns raw nucleotide sequences without alignment.
  Usually used by users who want to download the sequences for offline analysis. Not recommended for large result sets.
  
- [sample/alignedAminoAcidSequences](sample/alignedAminoAcidSequences):
  Returns translated protein sequences for multiple genes at once.
  Usually used by users who want to download the sequences for offline analysis. Not recommended for large result sets.
- [sample/alignedAminoAcidSequences/{gene}](sample/alignedAminoAcidSequences/{gene}):
  Returns translated protein sequences for a single gene. Specify the gene name in the URL path.
  Usually used by users who want to download the sequences for offline analysis. Not recommended for large result sets.
- [sample/nucleotideMutations](sample/nucleotideMutations): List nucleotide mutations with their proportions.
  Shows which nucleotide mutations appear in your filtered sequences and how frequently.
  Example: "C123T appears in 45% of sequences that match <filters>".
- [sample/aminoAcidMutations](sample/aminoAcidMutations): List amino acid mutations with their proportions.
  Shows which amino acid mutations appear in your filtered sequences and how frequently.
  Example: "S:484K appears in 30% of sequences that match <filters>".
- [sample/nucleotideInsertions](sample/nucleotideInsertions): List nucleotide insertions.
  Shows how often which insertion of nucleotides occurred in the nucleotide sequence(s) for the given filters.
- [sample/aminoAcidInsertions](sample/aminoAcidInsertions): List amino acid insertions.
  Shows how often which insertion of amino acids occurred in the amino acid sequences for the given filters.

### Time Series

These endpoints are mainly built for specialized display components that show time series data in a tabular form.
Useful for tracking trends over time.
These endpoints only accept POST.

- [sample/queriesOverTime](sample/queriesOverTime): Query results aggregated over time.
  Shows how many sequences match your filters for each time period (e.g., daily, weekly).
- [sample/nucleotideMutationsOverTime](sample/nucleotideMutationsOverTime): Query nucleotide mutations aggregated over time.
  Shows how mutation frequencies change over time. Useful for tracking the emergence and spread of specific mutations.
- [sample/aminoAcidMutationsOverTime](sample/aminoAcidMutationsOverTime): Query amino acid mutations aggregated over time.
  Shows how amino acid mutation frequencies change over time.


### Phylogenetic Analysis

- [sample/mostRecentCommonAncestor](sample/mostRecentCommonAncestor): Find most recent common ancestor for queried sequences.
  Identifies the MRCA node in the phylogenetic tree that contains all sequences matching your filters. Useful for understanding evolutionary relationships.
- [sample/phyloSubtree](sample/phyloSubtree): Get phylogenetic subtree in Newick format.
  Returns a subtree containing only the sequences matching your filters. The subtree is in Newick format and can be visualized in phylogenetic tree viewers.


### Info

- [info/info](info/info): Get instance information and versions.
  Useful for debugging or confirming you're connected to the right instance.
- [info/databaseConfig](info/databaseConfig): Retrieve the complete database configuration.
  Contains the complete metadata schema and configuration. Use this to discover what fields are available for filtering.
- [info/referenceGenome](info/referenceGenome): Retrieve the complete reference genome.
  Returns the full reference genome sequences. Warning: Large response.
  Only use when you need the actual reference sequences.
- [info/lineageDefinition/{column}](info/lineageDefinition/{column}):
  Retrieve the lineage definition file for a specific metadata column.
  Returns lineage hierarchy and parent-child relationships. Useful for understanding lineage classifications.
  Warning: Usually quite large.

## Query Examples

These examples demonstrate common query patterns using POST requests with JSON payloads.
All sample endpoints also support GET requests with query parameters.

### Count sequences by a metadata field

```
POST sample/aggregated
Content-Type: application/json

{
  "fields": ["authors"]
}
```

Returns count of sequences grouped by authors values.

### Filter by date range

```
POST sample/aggregated
Content-Type: application/json

{
  "dateFrom": "2023-01-01",
  "dateTo": "2023-12-31",
  "fields": ["date"]
}
```

Returns sequences within the specified date range, grouped by date.

### Find sequences with specific mutation

```
POST sample/details
Content-Type: application/json

{
  "nucleotideMutations": ["main:123T"],
  "limit": 10
}
```

Returns up to 10 sequences with the main:123T nucleotide mutation.

### Complex mutation filter with Boolean logic

```
POST sample/aggregated
Content-Type: application/json

{
  "nucleotideMutations": ["main:123T"],
  "aminoAcidMutations": ["E:484K", "E:484E"],
  "fields": ["authors"]
}
```

Returns sequences with main:123T mutation AND either E:484K OR E:484E amino acid mutation, grouped by authors.

### Get sequences for a specific gene

```
POST sample/alignedAminoAcidSequences
Content-Type: application/json

{
  "genes": ["E"],
  "authors": "someValue",
  "limit": 5
}
```

Returns up to 5 aligned amino acid sequences for the E gene.

The equivalent GET request would be:
```
GET sample/alignedAminoAcidSequences?genes=E&limit=5&authors=someValue
```

### Example 6: Analyze mutation proportions

```
POST sample/nucleotideMutations
Content-Type: application/json

{
  "authors": "someValue",
  "minProportion": 0.05
}
```

Returns all nucleotide mutations appearing in at least 5% of sequences matching the filter.

### Order results by multiple fields

```
POST sample/aggregated
Content-Type: application/json

{
  "fields": ["date", "authors"],
  "orderBy": [
    {"field": "count", "type": "ascending"},
    {"field": "date", "type": "ascending"},
    {"field": "authors", "type": "descending"}
  ]
}
```

Returns aggregated results grouped by date and authors, sorted first by ascending `count`, then by date (oldest first), then by authors (Z to A).
The `orderBy` parameter accepts an array of objects, each with a `field` (must be in the response) and `type` (`ascending` or `descending`).

GET requests only accept a plain array of field names for `orderBy`, and they will be sorted in ascending order:
```
GET sample/aggregated?fields=date&orderBy=date&orderBy=count
```