Next door to Delhi, a ‘bank’ to store country’s digitised biological data
The ‘Indian Biological Data Bank’ has come up at the Regional Centre for Biotechnology in Faridabad. The digitised data will be stored on a four-petabyte supercomputer called ‘Brahm’. A petabyte equals 10,00,000 gigabytes (gb).

The government has for the first time set up a digitised repository where Indian researchers will store biological data from publicly funded research, reducing their dependency on American and European data banks.
The ‘Indian Biological Data Bank’ has come up at the Regional Centre for Biotechnology in Faridabad. The digitised data will be stored on a four-petabyte supercomputer called ‘Brahm’. A petabyte equals 10,00,000 gigabytes (gb).
The government has mandated that data from all publicly funded research should be stored in this central repository. So it will not only provide a platform to researchers to securely store their data within the country, it will also provide access to a large database of indigenous sequences for analyses.
Such databases have traditionally played a key role in determining the genetic basis of various diseases and finding targets for vaccines and therapeutics.
“At present, most Indian researchers depend on the European Molecular Biology Laboratory (EMBL) and National Center for Biotechnology Information databases for storing the biological data. There are other smaller datasets available with some institutes, but those are not accessible to all. This will be the first national data repository, where the data will not only be submitted from across India but can be accessed by researchers from across India,” said Dr Sudhanshu Vrati, director of the Regional Centre for Biotechnology.
At the inauguration of the centre on Thursday, Union Science Minister Jitendra Singh said the bio-bank will create “Indian data for Indian solutions”.
“Many of our researchers still depend on other countries for such large databases, but the Indian phenotype is very different and solutions based on others’ data might not be optimal. We also need to look beyond. We can even provide our data to Western countries. You go to any of our public hospitals and you can find a patient with any disease you want to study; Western countries hardly see cases of tuberculosis or many other tropical diseases,” said Singh.
The bio-bank, which cost about Rs 85 crore to set up, currently accepts neucleotide sequences — the digitised genetic makeup of humans, plants, animals, and microbes. There are now 200 billion base pair data in the bio-bank, including 200 human genomes sequenced under the ‘1,000 Genome Project’, which is an international effort to map the genetic variations in people. The project will also focus on populations that are predisposed to certain diseases.
The database also contains most of the 2.6 lakh Sars-Cov-2 genomes sequenced by the Indian Sars-CoV-2 Genomic Consortium (INSACOG). These sequences, which are also uploaded to a global database, have helped the consortium keep track of Sars-CoV-2 variants circulating in the country and warn authorities about any emerging variant that might lead to more cases. For instance, the government learnt from this data that the Omicron sub-variant BA.2.75 was being overtaken by a recombinant variant XBB — which is a combination of two Omicron sub-lineages, BJ.1 and BA.2.75.
Other than human and Sars-CoV-2 genomes, the database will also store the 25,000 sequences of mycobacterium tuberculosis that another national consortium is trying to sequence. This will help not only in understanding the spread of multi-drug and extremely drug resistant TB in the country, but also aid the search for targets for new therapies and vaccines.
The database currently also stores the genomic sequences of crops such as rice, onion, tomatoes and mustard, among others. With genomes of humans, animals, and microbes present in the same database, it will also help researchers in studying zoonotic diseases, that is, diseases that jump from animals to humans.
Department of Biotechnology Secretary Dr Rajesh Gokhale said: “Take for example the BRCA gene that we know is associated with breast cancer. If we have a geographically representative database, we can actually determine the prevalence of breast cancer risk in the country by region. Or, if we compare our genes with sequences available from other parts of the world, we may detect mutations that are present only in our population.”
Although the database currently only accepts such genomic sequences, it is likely to expand later to storage of protein sequences – strings of amino acids that join together to form various proteins found in these organisms – and imaging data such as copies of Ultrasound and MRI.
The database currently offers two mechanisms for data submission to researchers. One, open access where the data uploaded can be immediately used by other researchers from across the country and two, controlled access where the data will not be openly shared for a number of years before being opened up to all.
The biobank also has a backup data ‘Disaster Recovery’ site at National Informatics Centre (NIC)-Bhubaneshwar.
“We are thinking of providing controlled access for a six year period — the government has to take a call on that. During this period the data will be stored on our servers but be accessible to only the researchers who have uploaded it. After the period, it will be made openly available to others,” said Dr Vrati. The data will also be tagged with an accession number that will make it searchable not only in the Indian database but also in international databases.