Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

Google Patent Phrase Similarity Dataset

Kaggle: “This is a human rated contextual phrase to phrase matching dataset focused on technical terms from patents. In addition to similarity scores that are typically included in other benchmark datasets we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, domain related. The dataset was used in the U.S. Patent Phrase to Phrase Matching competition. The dataset was generated with focus on the following:

  • Phrase disambiguation: certain keywords and phrases can have multiple different meanings. For example, the phrase “mouse” may refer to an animal or a computer input device. To help disambiguate the phrases we have included Cooperative Patent Classification (CPC) classes with each pair of phrases.
  • Adversarial keyword match: there are phrases that have matching keywords but are otherwise unrelated (e.g. “container section” → “kitchen container”, “offset table” → “table fan”). Many models will not do well on such data (e.g. bag of words models). Our dataset is designed to include many such examples.
  • Hard negatives: We created our dataset with the aim to improve upon current state of the art language models. Specifically, we have used the BERT model to generate some of the target phrases. So our dataset contains many human rated examples of phrase pairs that BERT may identify as very similar but in fact they may not be.
  • Each entry of the dataset contains two phrases – anchor and target, a context CPC class, a rating class, and a similarity score…”

Sorry, comments are closed for this post.