Author response: Sharing neurophysiology data from the Allen Brain Observatory

Saskia de Vries, Joshua H. Siegle, Christof Koch

2023

Full text Figures and data Side by side Abstract Editor's evaluation Introduction Overview of the Allen Brain Observatory Approach to data distribution Three families of use cases User experience Discussion Data availability References Decision letter Author response Article and author information Abstract Nullius in verba (‘trust no one’), chosen as the motto of the Royal Society in 1660, implies that independently verifiable observations—rather than authoritative claims—are a defining feature of empirical science. As the complexity of modern scientific instrumentation has made exact replications prohibitive, sharing data is now essential for ensuring the trustworthiness of one’s findings. While embraced in spirit by many, in practice open data sharing remains the exception in contemporary systems neuroscience. Here, we take stock of the Allen Brain Observatory, an effort to share data and metadata associated with surveys of neuronal activity in the visual system of laboratory mice. Data from these surveys have been used to produce new discoveries, to validate computational algorithms, and as a benchmark for comparison with other data, resulting in over 100 publications and preprints to date. We distill some of the lessons learned about open surveys and data reuse, including remaining barriers to data sharing and what might be done to address these. Editor's evaluation This article presents an important review of data-sharing efforts in neurophysiology, with a focus on data released by the Allen Institute for Brain Science. The article offers perspectives from the users of such shared data, and makes a compelling case that data sharing has already advanced research in neuroscience. There are valuable insights here for producers and users of neurophysiology data, as well as the funders that support all those efforts. https://doi.org/10.7554/eLife.85550.sa0 Decision letter eLife's review process Introduction Why share data? The central nervous system is among the most complex organs under investigation. Accordingly, the tools to study it have become intricate and costly, generating ever-growing torrents of data that need to be ingested, quality-controlled, and curated for subsequent analysis. Not every lab has the financial or personnel resources to accomplish this. Moreover, while many scientists relish running experiments, others find their passion in analysis. Data collection requires a different skillset than analysis, especially as the field demands more comprehensive and higher-dimensional datasets, which, in turn, necessitate more advanced analytical methods and software infrastructure. A scientific ecosystem in which data is extensively shared and reused would give researchers more freedom to focus on their favorite parts of the discovery process. Sharing data brings other benefits as well. It increases the number of eyes on each dataset, making it easier to spot potential outlier effects (Button et al., 2013). It encourages meta-analyses that integrate data from multiple studies, providing the opportunity to reconcile apparently contradicting results or expose the biases inherent in specific analysis pipelines (Botvinik-Nezer et al., 2020; Mesa et al., 2021). It also gives researchers a chance to test hypotheses on existing data, refining and updating their ideas before embarking on the more costly process of running new experiments. Without a doubt, reanalysis of neurophysiology data has already facilitated numerous advances. Electrophysiological recordings from nonhuman primates, which require tremendous dedication to collect, are often reused in multiple high-impact publications (Churchland et al., 2010; Murray et al., 2014). Data from ‘calibration’ experiments, in which activity of individual neurons is monitored via two modalities at once, have been extremely valuable for improving data processing algorithms (GENIE Project, 2015; Henze et al., 2009; Huang et al., 2021; Neto et al., 2016). A number of these datasets have been shared via the website of CRCNS (Teeters et al., 2008), far-sighted organization focused on aggregating data for computational neuroscience within the same searchable database. To date, CRCNS hosts 150 datasets, including extensive neurophysiology recordings from a variety of species, as well as fMRI, EEG, and eye movement datasets. This is especially impressive given that CRCNS was launched by a single lab in 2008. The repository does not enforce formatting standards, and thus each dataset differs in its packaging conventions, as well as what level of preprocessing may have been applied to the data. The website includes a list of 111 publications and preprints based on CRCNS data. Our own meta-analysis of these articles shows that 28 out of 150 datasets have been reused at least once, with four reused more than 10 times each. More recently, an increasing number of researchers are choosing to make data public via generalist repositories such as Figshare, Dryad, and Zenodo, or the neuroscience-specific G-Node Infrastructure. In addition, the lab of György Buzsáki maintains a databank of recordings from more than 1000 sessions from freely moving rodents (Petersen et al., 2020). As data can be hosted on these repositories for free, they greatly lower the barriers to sharing. However, the same features that reduce the barriers for sharing can also increase the barriers for reuse. With no restrictions on the data format or level of documentation, learning how to analyze diverse open datasets can take substantial effort, and scientists are limited in their ability to perform meta-analyses across datasets. Further, with limited and nonstandard documentation, finding relevant datasets can be challenging. Since its founding, the Allen Institute has made open data one of its core principles. Specifically, it has become known for generating and sharing survey datasets within the field of neuroscience, taking inspiration from domains such as astronomy where such surveys are common. (As a community, astronomers have developed a far more comprehensive and coherent data infrastructure than biology. One obvious reason is the existence of a single sky with an agreed-upon coordinate system and associated standards such as the Flexible Image Transport System; Borgman et al., 2016; York et al., 2000; Zuiderwijk and Spiers, 2019.) The original Allen Mouse Brain Atlas (Lein et al., 2007) and subsequent surveys of gene expression (Bakken et al., 2016; Hawrylycz et al., 2012; Miller et al., 2014), mesoscale connectivity (Harris et al., 2019; Oh et al., 2014), and in vitro firing patterns (Gouwens et al., 2019) have become essential resources across the field. These survey datasets are (1) collected in a highly standardized manner with stringent quality controls, (2) create a volume of data that is much larger than typical individual studies within their particular disciplines, and (3) are collected without a specific hypothesis to facilitate a diverse range of use cases. Starting a decade ago, we began planning the first surveys of in vivo physiology in mouse cortex with single-cell resolution (Koch and Reid, 2012). Whereas gene expression and connectivity are expected to change relatively slowly, neural responses in awake subjects can vary dramatically from moment to moment, even during apparently quiescent periods (McCormick et al., 2020). Therefore, an in vivo survey of neural activity poses new challenges, requiring many trials and sessions to account for both intra- as well as inter-subject variability. We first used two-photon calcium imaging and later Neuropixels electrophysiology to record spontaneous and evoked activity in visual cortex and thalamus of awake mice that were passively exposed to a wide range of visual stimuli (known as ‘Visual Coding’ experiments). A large number of subjects, highly standardized procedures, and rigorous quality control criteria distinguished these surveys from typical small-scale neurophysiology studies. More recently, the Institute carried out surveys of single-cell activity in mice performing a visually guided behavioral task (known as ‘Visual Behavior’ experiments). In all cases, the data was shared even before we published our own analyses of them. We reflect here on the lessons learned concerning the challenges of data sharing and reuse in the neurophysiology space. Our primary takeaway is that the widespread mining of our publicly available resources demonstrates a clear community demand for open neurophysiology data and points to a future in which data reuse becomes more commonplace. However, more work is needed to make data sharing and reuse practical (and ideally the default) for all laboratories practicing systems neuroscience. Overview of the Allen Brain Observatory The Allen Brain Observatory consists of a set of standardized instruments and protocols designed to carry out surveys of cellular-scale neurophysiology in awake brains (de Vries et al., 2020; Siegle et al., 2021a). Our initial focus was on neuronal activity in the mouse visual cortex (Koch and Reid, 2012). Vision is the most widely studied sensory modality in mammals, but much of the foundational work is based on recordings with hand-tuned stimuli optimized for individual neurons, typically investigating a single area at a time (Hubel and Wiesel, 1998). The field has lacked the sort of unbiased, large-scale surveys required to rigorously test theoretical models of visual function (Olshausen and Field, 2005). The laboratory mouse is an advantageous model animal given the extensive ongoing work on mouse cell types (BRAIN Initiative Cell Census Network (BICCN), 2021; Tasic et al., 2018; Yao et al., 2021; Zeisel et al., 2015), as well as access to a well-established suite of genetic tools for observing and manipulating neural activity via driver and reporter lines or viruses (Gerfen et al., 2013; Madisen et al., 2015). Our two-photon calcium imaging dataset (Allen Institute Mindscope Program, 2016) leveraged transgenic lines to drive the expression of a genetically encoded calcium indicator (Chen et al., 2013) in specific populations of excitatory neurons (often constrained to a specific cortical layer) or GABAergic interneurons. In total, we recorded activity from over 63,000 neurons across 6 cortical areas, 4 cortical layers, and 14 transgenic lines (Figure 1). The Neuropixels electrophysiology dataset (Allen Institute MindScope Program, 2019) used silicon probes (Jun et al., 2017) to record simultaneously from the same six cortical areas targeted in the two-photon dataset, as well as additional subcortical regions (Durand et al., 2022). While cell type specificity was largely lost, transgenic lines did enable optotagging of specific inhibitory interneurons. The Neuropixels dataset included recordings from over 40,000 units passing quality control across more than 14 brain regions and 4 mouse lines (Figure 1). In both surveys, mice were passively exposed to a range of visual stimuli. These included drifting and flashed sinusoidal gratings to measure traditional spatial and temporal tuning properties, sparse noise or windowed gratings to map spatial receptive fields, images and movies that have natural spatial and temporal statistics, and epochs of mean luminance to capture neurons’ spontaneous activity. These stimuli were selected to provide a broad survey of visual physiological activity and compare the organization of visual responses across brain regions and cell types. Mice were awake during these experiments and head-fixed on a spinning disk that permitted them to run in a self-initiated and unguided manner. Subsequent surveys of neural activity in mice performing a behavioral task are not discussed here as it is too soon to begin evaluating their impact on the field. Figure 1 Download asset Open asset Overview of Allen Brain Observatory Visual Coding datasets. (a) Target brain regions and example visual stimuli. (b) Standardized rigs for two-photon calcium imaging (left) and Neuropixels electrophysiology (right). (c) Example ΔF/F traces or spike rasters for 100 simultaneously recorded neurons from each modality. Both are extracted during one presentation of a 30 s natural movie. (d) Dataset size after different stages of analysis. Figure 1—source data 1 Dataset size after each stage of analysis. https://cdn.elifesciences.org/articles/85550/elife-85550-fig1-data1-v1.xlsx Download elife-85550-fig1-data1-v1.xlsx Approach to data distribution Once the data was collected, we wanted to minimize the friction required for external groups to access it and mine it for insights. This is challenging! Providing unfettered access to the data can be accomplished by providing a simple download link; yet, unless the user understands what is contained in the file and has installed the appropriate libraries for parsing the data, its usefulness is limited. At the other extreme, a web-based analysis interface that does not require any downloading or installation can facilitate easy data exploration, but this approach has high upfront development costs and imposes limitations on the analyses that can be carried out. These conflicting demands are apparent in our custom tool, the AllenSDK, a Python package that serves as the primary interface for downloading data from these surveys as well as other Allen Institute resources. In the case of the Allen Brain Observatory, the AllenSDK provides wrapper functions for interacting with the Neurodata Without Borders (NWB) files (Rübel et al., 2022; Teeters et al., 2015) in which the data is stored. Intuitive functions enable users to search metadata for specific experimental sessions and extract the relevant data assets. Whereas our two-photon calcium imaging survey was accompanied by a dedicated web interface that displayed summary plots for every cell and experiment (observatory.brain-map.org/visualcoding), we discontinued this practice because of its associated development costs and because most users preferred to directly access the data in their own analysis environment. One challenge with sharing cellular neurophysiology data is that it includes multiple high-dimensional data streams. Many other data modalities (e.g., gene expression) can be reduced to a derived metric and easily shared in a tabular format (e.g., cell-by-gene table). In contrast, neurophysiological data is highly varied, with researchers taking different approaches to both data processing (e.g., spike sorting or cell segmentation) and analysis. While these data can be analyzed as a large collection of single-cell recordings, they can also be approached as population recording, leveraging the fact that hundreds to thousands of neurons are recorded simultaneously. Thus, particularly for a survey-style dataset not designed to test a particular hypothesis, it is hard to reduce these recordings to a simple set of derived metrics that encapsulate the full range of neural and behavioral states. Even when it is possible (e.g., we could have shared a table of single-cell receptive field and tuning properties as the end product), this confines any downstream analyses to those specific metrics, severely undermining the space of possible use cases. At the same time, if we had only shared the raw data, few researchers would have the resources or the inclination to build their own preprocessing and packaging pipelines. Therefore, we aimed to share our data in a flexible way to facilitate diverse use cases. For every session, we provided either spike times or fluorescence traces, temporally aligned stimulus information, the mouse’s running speed and pupil tracking data, as well as intermediate, derived data constructs, such as ROI masks, neuropil traces, and pre- and post- de-mixing traces for two-photon microscopy, and waveforms across channels for Neuropixels. All are contained within the NWB files. In addition, we uploaded the more cumbersome, terabyte-scale raw imaging movies and voltage traces to the public cloud for users focused on data processing algorithms (Figure 2). Figure 2 Download asset Open asset Distributing data from Allen Brain Observatory Visual Coding experiments. Raw data is acquired and processed at the Allen Institute, combined with metadata (including 3-D neuronal coordinates, stimulus information, eye-tracking data, and running speed) and packaged into NWB files. Each such file is intended to be a complete, self-contained data record for one experimental session in one animal. NWB files are uploaded to three different locations in the cloud: The Amazon Web Services (AWS) Registry of Open Data, the Distributed Archives for Neurophysiology Data Integration (DANDI) repository, and the Allen Institute data warehouse (accessed via the AllenSDK, a Python API for searching for relevant sessions and downloading data). Raw data is also uploaded to the AWS Registry of Open Data. End users can either analyze data in the cloud or download data for local analysis. Three families of use cases The first round of two-photon calcium imaging data was released in July 2016, followed by three subsequent releases that expanded the dataset (green triangles in Figure 3). The Neuropixels dataset became available in October 2019 (yellow triangle in Figure 3). At the end of 2022, there were 104 publications or preprints that reuse these two datasets, with first authors at 50 unique institutions. This demonstrates the broad appeal of applying a survey-style approach to the domain of in vivo neurophysiology. Figure 3 Download asset Open asset Data reuse over time. Cumulative number of papers or preprints that include novel analysis of Allen Brain Observatory Visual Coding surveys. Triangles indicate the years in which new data was made publicly available. Paper icons indicate the years in which the Allen Institute preprints describing the dataset contents and initial scientific findings were posted. We found three general use cases of Allen Brain Observatory data in the research community: novel about brain function new computational models and algorithms with experiments the Allen Institute we some of these three use cases, for both the two-photon calcium imaging and Neuropixels datasets. All these studies were carried out by groups external to the Allen Institute, and without any from to the with which data can be and and used Allen Brain Observatory two-photon imaging data to the of neural responses over time. found that neurons in a model with high inherent had more in their stimulus than those with also found that neurons with high inherent have population To these were they here analyzed fluorescence traces from the Allen Brain Observatory to population and response were The authors found population is with the change in and tuning of neurons over the of a single an population activity with individual neural et al., a neural could model both the and of the visual system in a single with a single two one with a single and the other with two a Coding the of these with the neural responses in the two-photon imaging dataset, they found that the single but to capture the of the The that the This work is an of how large-scale data can the development of neural how those approaches can our of cortical et al., analyzed the time of in neurons in the Neuropixels dataset and that a single presentation of a drifting or in a specific to a in the response to the same visual stimulus to trials in the This and is in all six visual cortical areas, but not in visual areas and which to after one or two This is a example of a discovery that was not when our but for which our stimulus set was well At least three publications have of the fact that every Neuropixels visual cortex and thalamus also the and et al., analyzed the local field potential from these to the of to information out of the and found that with a increase in connectivity with the et al., the of this connectivity and found that but of visual cortex neurons were by in while others were more to in and analyzed the responses of neurons to natural movies and found that many displayed highly that were often as as those of neurons in visual However, in to visual the in the if the were the learned temporal the Allen Brain Observatory experiments were not designed to test hypotheses of the Neuropixels dataset out to be for the this and visual cortical and models and algorithms Many researchers used the numerous and diverse fluorescence movies in the two-photon imaging dataset to validate processing As the different transgenic lines used in the dataset different populations of neurons, they have different As a there are some sparse movies with only a neurons within the field of and others with to This makes the dataset a for methods for cell et al., 2021; et al., 2021; et al., 2020; et al., 2018; et al., neurons across multiple sessions et al., and in the fluorescence traces et al., 2022; et al., 2022). et al., used the Neuropixels survey to a novel for in neural activity. of a cell is to without the need to such as spike As an they analyze the of the Neuropixels experiments carried out in the of with of to in the of genetically cell types at the end of each session, the authors how these recordings can be to test the impact of a particular of interneurons. not only neurons that are directly by the but also cortical neurons that are on and over et al., used raw data from the Neuropixels survey to validate a Python package that multiple spike sorting algorithms in and their We spike sorting with one such 2 et al., 2016). The authors of this used to compare the of 2 and additional In one example session, over 1000 units were by only one while only units were by or more At first this finding to indicate a high level of among the However, when these results with those from it became clear that the units were while the units were highly across This and the package in be essential for improving the of spike sorting in the with other datasets et al., used and learning algorithms to cortical visual areas based on either spontaneous activity or visually evoked visual areas, based on are to visual processing than compare tuning properties of neurons across the areas, as many studies (including our have the authors to the area and from the neural responses to visual stimuli. the of these algorithms for their own imaging dataset with our two-photon imaging This provides an and of their results to in which single-cell responses are available. and recordings in mouse cortex to where neurons to in visual and from The authors that these responses from visual features than the that these responses might be by tuning to temporal The authors use our two-photon imaging dataset to a in temporal tuning across cortical layers, with neurons in to lower the fact that responses are in While this use case is one of the it is an of for that from one’s own experiments. et al., activity from the Neuropixels dataset to fluorescence recorded in their analysis focused on the with which the of gratings can be from activity in visual their own two-photon calcium imaging dataset that of to simultaneously recorded neurons, they found that it was possible to use neural activity to that by than about a of 100 than behavioral in mice. As an important they that the in evoked responses to gratings was their two-photon data and our Neuropixels electrophysiology data, that their was not to on the modality. This use case is because the this comparison than a after our dataset became publicly available. et al., directly the Allen Neuropixels dataset with Neuropixels recordings from and carried out first analyzed these two in the Allen Brain Observatory dataset and found in support of their hypothesis that is by This with the hypothesis 2015), which that is for information a set of Neuropixels recordings in which they found that cortex of did not change the of and that is from their This is an example of how a survey dataset can be used to test a hypothesis, followed by a set of more specific experiments that the initial findings. in These surveys have also been used in a variety of Many computational neuroscience have them as potential of This includes the Allen own on the Brain as well as the Data and Vision and at the and the Brain in some cases these have to publications and 2022; et al., 2022). these these datasets are discussed in