Google Dataset Search: Out of Beta

Google Dataset Search was launched in September 2018 with the goal to create a searchable public data repository. The search engine searches on data repositories on the Web based on their meta-data and to date, it includes millions of datasets from a variety of sources. The search engine is based on https://schema.org/ that uses an open standard that organizes the metadata. Anyone can contribute datasets to this engine but they must follow the schema.org guidelines. Further details regarding contributing data can be found here.

Below is a diagram as to how the dataset search engine actually works. Using schema.org standards the platform embeds structured information into HTML, not affecting the appearance of the page. Further details as to the technology behind the search engine can be found in this 2018 Google AI Blog.



















An overview of the technology behind Google Dataset Search


In late January 2020, the dataset search engine came out of beta with many new features, particularly filters. The home page is simple with just the search window. Typing in e.g. “hospital readmissions” provides several results in a drop-down window. Those choices can be selected or selecting the spyglass icon generates a full dataset search. When “Hospital Readmissions” is entered the following screenshot is generated.


















The upper menu consists of the following choices: Updated Date, Download Format, Usage Rights and Free. The following table provides details of each menu function.



A list of datasets found appears on the left along with a retrieval count and the selected dataset appears on the right. The selected dataset begins with hyperlinks at the top for easy access. The body of the dataset consists of the following sections 1. Dataset updated 2. Dataset provided by 3. License status 4. Available download formats and 5. Description.

Some datasets will also include pertinent articles (viewed in Google Scholar) that cite these data sets.

With the search term hospital readmissions, there were 100+ results. The search was modified to search by table and 43 results were generated. Note you can search by the general categories like table and image but not the specific searches such as .csv or jpeg. The user must then review the returned datasets to find one with the required table or image.

Datasets most commonly are from government organizations in the United States and other countries. Datasets are also archived on commercial sites such as Data World, John Snow Labs, KDNuggets, fee-based (Statistica) sites and online repositories such as Figshare and Plos.figshare.

A “share” icon in the upper right permits sharing via Facebook, Twitter, and email as well as the ability to copy and share the page link.

Google Dataset Search is also now configured for mobile access.

Overall, this new search engine should save time for the average researcher and student. Google intends to add more datasets and functionality down the road. More detailed filters would be an asset, such as searching for datasets in only a specific file type.


Posted on Medium.com January 30, 2020

Recent Posts

See All

RapidMiner Go - Machine Learning on the Go

Today I blogged on Medium. com on a new machine learning web-based program from RapidMiner. Read more at https://medium.com/@rehoyt/rapidminer-go-machine-learning-on-the-go-cd858d409765

©2019 by Introduction to Biomedical Data Science. Proudly created with Wix.com