Google Dataset Search: Out of Beta

Google Dataset Search was launched in September 2018 with the goal to create a searchable public data repository. The search engine searches on data repositories on the Web based on their meta-data and to date, it includes millions of datasets from a variety of sources. The search engine is based on https://schema.org/ that uses an open standard that organizes the metadata. Anyone can contribute datasets to this engine but they must follow the schema.org guidelines. Further details regarding contributing data can be found here.

Below is a diagram as to how the dataset search engine actually works. Using schema.org standards the platform embeds structured information into HTML, not affecting the appearance of the page. Further details as to the technology behind the search engine can be found in this 2018 Google AI Blog.



















An overview of the technology behind Google Dataset Search


In late January 2020, the dataset search engine came out of beta with many new features, particularly filters. The home page is simple with just the search window. Typing in e.g. “hospital readmissions” provides several results in a drop-down window. Those choices can be selected or selecting the spyglass icon generates a full dataset search. When “Hospital Readmissions” is entered the following screenshot is generated.