Introduction

LAPIS (Lightweight API for Sequences) is an open web application programming interface (API) allowing easy querying of SARS-CoV-2 sequencing data using web links. The core features are:

Filter sequences by metadata or mutations
Aggregate data by any metadata field you like
Get the full metadata
Get the sequences as FASTA (aligned or unaligned)
Responses can be formatted as JSON and as CSV

An OpenAPI-specification is available here.

This instance uses fully public data from NCBI GenBank pre-proceessed and hosted by Nextstrain. We update the data every day. More information about the underlying software and the code can be found in our Github repository.

In following, we demostrate the core features enabled by the API. On the left, we present the basic syntax of the API and on the right, we show how to use it for queries. In the section "Use Cases", we provide examples how to use the API to query public SARS-CoV-2 sequencing data to generate statistics, create plots, or download sequences for further analysis.

Overview

The API has six main endpoints related to samples. These endpoints provide different types of data:

/sample/aggregated - use to get summary data aggregated across samples
/sample/details - use to get per-sample metadata
/sample/aa-mutations - use to get the common amino acid mutations (shared by at least 5% of the sequences)
/sample/nuc-mutations - use to get the common nucleotide mutations (shared by at least 5% of the sequences)
/sample/fasta - use to get original (unaligned) sequences
/sample/fasta-aligned - use to get aligned sequences

The API returns a response (data) based on a query to one of the endpoints. You can view a response in your browser, or use the data programatically. We'll provide some examples in R.

The /sample/fasta and /sample/fasta-aligned endpoints has a limit of 100,000 sequences per request. We do not support pagination (yet). However, you could, for example, use the dateFrom and dateTo filters to load the sequences chunk by chunk.

Query Format

Query example:

Get the total number of available sequences:
/sample/aggregated

To query an endpoint, use the web link with prefix https://lapis.cov-spectrum.org/open/v1 and the suffix for the relevant endpoint. In the examples, we only show the suffixes to keep things simple, but you can click to try the full link in your browser.

Response Format

Response example:

{
  "info":{"apiVersion":1,"deprecationDate":null,"deprecationInfo":null},
  "errors":[],
  "data":[{"count":913515}]
}

The responses can be formatted in JSON or CSV. The default is JSON. To get CSV responses, append the query parameter dataFormat=csv.

Responses returned in the JSON format have three top level attributes:

"info" - data about the API itself
"errors" - an array (hopefully empty!) of things that wrong.
"data" - the actual data

Filters

Examples:

Get the number of all samples in Switzerland in 2021:
/sample/aggregated?country=Switzerland&dateFrom=2021-01-01&dateTo=2021-12-31

{
  "info":{"apiVersion":1,"deprecationDate":null,"deprecationInfo":null},
  "errors":[],
  "data":[{"count":22701}]
}

Get details about samples from lineage AY.1 in Geneva, Switzerland:
/sample/details?country=Switzerland&division=Geneva&pangoLineage=AY.1

{
  "info": {"apiVersion":1,"deprecationDate":null,"deprecationInfo":null},
  "errors": [],
  "data": [
    {
      "date": "2021-05-26",
      "dateSubmitted": "2021-06-29",
      "region": "Europe",
      "country": "Switzerland",
      "division": "Geneva",
      "location": null,
      "regionExposure": "Europe",
      "countryExposure": "Switzerland",
      "divisionExposure": "Geneva",
      "age": null,
      "sex": null,
      "host": "Homo sapiens",
      "samplingStrategy": null,
      "pangoLineage": "AY.1",
      "nextstrainClade": "21A (Delta)",
      "gisaidCloade": null,
      "submittingLab": null,
      "originatingLab": null,
      "genbankAccession": "OU268406",
      "sraAccession": null,
      "gisaidEpiIsl": "EPI_ISL_2405325",
      "strain":"Switzerland/GE-HUG-34284688/2021"
    },
    ...
  ]
}

Large queries, for example detailed information on all the samples, will take a bit. Instead, we can adapt the query to filter to only samples of interest. The syntax for additing filters is <attribute1>=<valueA>&<attribute2>=<valueB>.

All six sample endpoints can be filtered by the following attributes:

dateFrom
dateTo
dateSubmittedFrom
dateSubmittedTo
region
country
division
location
regionExposure
countryExposure
divisionExposure
ageFrom
ageTo
sex
host
samplingStrategy
pangoLineage (see section "Filter Pango Lineages")
nextcladePangoLineage
nextstrainClade
gisaidClade
submittingLab
originatingLab
nucMutations (see section "Filter Mutations")
aaMutations (see section "Filter Mutations")
nextcladeQcOverallScoreFrom
nextcladeQcOverallScoreTo
nextcladeQcMissingDataScoreFrom
nextcladeQcMissingDataScoreTo
nextcladeQcMixedSitesScoreFrom
nextcladeQcMixedSitesScoreTo
nextcladeQcPrivateMutationsScoreFrom
nextcladeQcPrivateMutationsScoreTo
nextcladeQcSnpClustersScoreFrom
nextcladeQcSnpClustersScoreTo
nextcladeQcFrameShiftsScoreFrom
nextcladeQcFrameShiftsScoreTo
nextcladeQcStopCodonsScoreFrom
nextcladeQcStopCodonsScoreTo

The endpoints details, aa-mutations, nuc-mutations, fasta, and fasta-aligned can additionally be filtered by these attributes:

genbankAccession
sraAccession
gisaidEpiIsl
strain

To determine which values are available for each attribute, see the example in section "Aggregation".

Filter Pango Lineages

Get the total number of samples of the lineage B.1.617.2 without sub-lineages:
/sample/aggregated?pangoLineage=B.1.617.2

Get the total number of samples of the lineage B.1.617.2 including sub-lineages:
/sample/aggregated?pangoLineage=B.1.617.2*

Pango lineage names inherit the hierarchical nature of genetic lineages. For example, B.1.1 is a sub-lineage of B.1. More information about the pango nomenclature can be found on the website of the Pango network.

With the pangoLineage filter, it is possible to not only filter for a very specific lineage but also to include its sub-lineages. To include sub-lineages, add a * at the end. For example, writing B.1.351 will only give samples of B.1.351. Writing B.1.351* or B.1.351.* (there is no difference between these two options) will return B.1.351, B.1.351.1, B.1.351.2, etc.

An official pango lineage name can only have at most three number components. A sub-lineage of a lineage with a maximal-length name (e.g., B.1.617.2) will get an alias. A list of aliases can be found here. B.1.617.2 has the alias AY so that AY.1 would be a sub-lineage of B.1.617.2. LAPIS is aware of aliases. Filtering B.1.617.2* will include every lineage that starts with AY. It is further possible to search for B.1.617.2.1 which will then return the same results as AY.1.

Filter Mutations

Get the total number of samples with the synonymous nucleotide mutations 913T and 5986T and the amino acid mutation S:484K:
/sample/aggregated?nucMutations=913T,5986T&aaMutations=S:484K

Get the total number of samples for which we do not know whether the S:501 position is mutated:
/sample/aggregated?aaMutations=S:501X

It is possible to filter for amino acid and nucleotide bases/mutations. Multiple mutations can be provided by specifying a comma-separated list.

A nucleotide mutation has the format <position><base>. A "base" can be one of the four nucleotides A, T, C, and G. It can also be - for deletion and N for unknown.

An amino acid mutation has the format <gene>:<position><base>. The following genes are available: E, M, N, ORF1a, ORF1b, ORF3a, ORF6, ORF7a, ORF7b, ORF8, ORF9b, S. A "base" can be one of the 20 amino acid codes. It can also be - for deletion and X for unknown.

The <base> can be omitted to filter for any mutation. You can write a . for the <base> to filter for sequences for which it is confirmed that no mutation occurred, i.e., has the same base as the reference genome at the specified position.

Aggregation

Examples:

Get the number of B.1.1.7 samples per country:
/sample/aggregated?fields=country&pangoLineage=B.1.1.7

{
  "info": {"apiVersion":1,"deprecationDate":null,"deprecationInfo":null},
  "errors": [],
  "data": [
    {"country": "Austria", "count": 82},
    {"country": "Bahrain", "count": 48},
    ...
  ]
}

Get the number of samples per Nextstrain clade and country:
/sample/aggregated?fields=nextstrainClade,country

{
  "info": {"apiVersion":1,"deprecationDate":null,"deprecationInfo":null},
  "errors": [],
  "data": [
    {"nextstrainClade": "19A", "country": "Australia", "count": 317},
    {"nextstrainClade": "19A", "country": "Bahrain", "count": 2},
    ...
  ]
}

Get all the possible values for attribute "division" in Swtizerland:
/sample/aggregated?division,country=Switzerland

{
  "info": {"apiVersion":1,"deprecationDate":null,"deprecationInfo":null},
  "errors": [],
  "data": [
    {"division": "Basel-Land", "count": 4658},
    {"division": "Aargau", "count": 2964},
    ...
  ]
}

Above, we used the /sample/aggregated endpoint to get the total counts of sequences with or without filters. Using the query parameter fields, we can group the samples and get the counts per group. For example, we can use it to get the number of samples per country. We can also use it to list the available values for each attribute.

fields accepts a comma-separated list. The following values are available:

date
dateSubmitted
region
country
division
location
regionExposure
countryExposure
divisionExposure
age
sex
host
samplingStrategy
pangoLineage
nextcladePangoLineage
nextstrainClade
gisaidClade
submittingLab
originatingLab

Use Cases

We demonstrate two use cases for this API in R.

Plot the global distribution of all sequences

library(jsonlite)
library(ggplot2)

# Query the API
response <- fromJSON("https://lapis.cov-spectrum.org/open/v1/sample/aggregated?fields=region")

# Check for errors
errors <- response$errors
if (length(errors) > 0) {
  stop("Errors")
}

# Check for deprecation
deprecationDate <- response$info$deprecationDate
if (!is.null(deprecationDate)) {
  warning(paste0("This version of the API will be deprecated on ", deprecationDate,
                 ". Message: ", response$info$deprecationInfo))
}

# The data is good to be used!
data <- response$data

# Make a plot
ggplot(
  data,
  aes(x = "", y = count, fill = region)) + 
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y", start = 0) + 
  theme_minimal() + 
  theme(
    panel.grid=element_blank(),
    panel.border = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(), 
    axis.title.y = element_blank(),
    axis.text.x = element_blank())

Steps:

Query data from the API
Check whether there are errors. If yes, abort!
Check whether a deprecation date is given. If yes, write a warning.
Parse data from JSON as a data frame.
Use the data frame to create a plot.

Plot the count of delta samples in a country in the past 100 days

library(jsonlite)
library(ggplot2)

# Query the API
date_from <- format(Sys.Date() - as.difftime(100, unit = "days"), "%Y-%m-%d")
query <- paste0(
  "https://lapis.cov-spectrum.org/open/v1/sample/aggregated?",
  "fields=date",
  "&country=Switzerland",
  "&dateFrom=", date_from,
  "&pangoLineage=B.1.617.2*"
)
response <- fromJSON(query)

# Check for errors
errors <- response$errors
if (length(errors) > 0) {
  stop("Errors")
}

# Check for deprecation
deprecationDate <- response$info$deprecationDate
if (!is.null(deprecationDate)) {
  warning(paste0("This version of the API will be deprecated on ", deprecationDate,
                 ". Message: ", response$info$deprecationInfo))
}

# The data is good to be used!
data <- response$data

# Make a plot
ggplot(
  data,
  aes(x = as.Date(date), y = count)) + 
  geom_col() + 
  theme_bw() + 
  labs(x = element_blank(), y = "Count") + 
  scale_x_date(date_breaks = "1 month", date_labels = "%B %Y") + 
  ggtitle("Count of delta samples in Switzerland in the past 100 days")