Skip to content

cverluise/PatCit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

patCit

Documentation DOI

Building a comprehensive dataset of patent citations

๐Ÿ‘ฉโ€๐Ÿ”ฌ Exploring the universe of patent citations has never been easier. No more complicated data set-up, memory issue and queries running for ever, we host patCit on BigQuery for you.

๐Ÿค— patCit is community driven and benefits from the suppport of a reactive team who is eager happy to help and tackle your next request. This is where academics and industry practitioners meet.

๐Ÿ”ฎ patCit is based on state-of-the-art open source projects and libraries such as grobid/biblio-glutton and spaCy. Even better, patCit is continuously improving with the rest of its ecosystem.

๐ŸŽ“ Want to know more? Read patCit academic presentation or dive into usage and technical guides on patCit documentation website.

๐Ÿ’Œ Receive project updates in your mails/gitHub feed, join the patCit newsletter and star the repository on gitHub.

What will you find in patCit?

Patents are at the crossroads of many innovation nodes: science, open knwoledge, products, competition, etc. At patCit, we are building a comprehensive dataset of patent citations to help the community explore this terra incognita. patCit is:

  • ๐ŸŒŽ worlwide coverage
  • ๐Ÿ“„ & ๐Ÿ“š front-page and in-text citations
  • ๐ŸŒˆ all sorts of documents, not just scientific articles

๐Ÿ’ก How we do? We use recent progress in Natural Language Processing (NLP) to extract and structure citations into actionable piece of information.

Front-page

patCit builds on DOCDB, the largest database of Non Patent Literature (NPL) citations. First, we deduplicate this corpus and organize it into 10 categories. Then, we design and apply category specific information extraction models using spaCy. Eventually, when possible, we enrich the data using external domain specific high quality databases.

Category Classification Information extraction Enrichment BigQuery table Colab notebook
Bibliographical reference

โœ…

โœ…

โœ…

๐Ÿ”œ

Office action

โœ…

Patent

โœ…

Search report

โœ…

Product documentation

โœ…

Norm & standard

โœ…

โœ…

Open In Colab
Webpage

โœ…

Database

โœ…

โœ…

๐Ÿ”œ

Litigation

โœ…

Wiki

โœ…

โœ…

Open In Colab
All

โœ…

NR

โœ…

Open In Colab

In-text

patCit builds on Google Patents corpus of USPTO full-text patents. First, we extract patent and bibliographical reference citations. Then, we parse detected in-text citations into a series of category dependent attributes using grobid[grobid. Patent citations are matched with a standard publication number using the Google Patents matching API and bibliographical references are matched with a DOI using biblio-glutton. Eventually, when possible, we enrich the data using external domain specific high quality databases.

Category Citation extraction Information extraction Enrichment BigQuery table Colab notebook
Bibliographical reference

โœ…

โœ…

โœ…

๐Ÿ”œ

Patents

โœ…

โœ…

โœ…

๐Ÿ”œ

FAIR

๐Ÿ“ Find - The patCit dataset is available on BigQuery in an interactive environment. For those who have a smattering of SQL, this is the perfect place to explore the data. It can also be downloaded on Zenodo.

๐Ÿ‘จโ€๐ŸŽ“ If you are new to BigQuery and want to learn the basics of Google BigQuery (GBQ), you can take the GBQ Quickstart. This should not take more than 2 minutes and might help a lot !

๐Ÿ“– Access - We maintain a detailed documentation on how to access the data once you have found them on BigQuery or Zenodo. See usage notes on the patCit documentation website.

๐Ÿ”€ Interoperate - Interoperability is at the core of patCit ambition. We take care to extract unique identifiers whenever it is possible to enable data enrichment for domain specific high quality databases. This includes the DOI, PMID and PMCID for bibliographical references, the Technical Doc Number for standards, the Accession Number for Genetic databases, the publication number for PATSTAT and Claims, etc. See specific table for more details.

๐Ÿ”‚ Reproduce - You are at the right place. This gitHub repository is the project factory. You can learn more about data recipes and models on the patCit documentation website.

Contributing

There are many ways to contribute to patCit, many do not include coding.

Give feedback - We want to make patCit truly useful to the community. We are thus very happy for feedback.

Share your thoughts - We believe that discussions are much more valuable if they are publicly shared. This way, everyone can benefit from it. Hence, we strongly encourage you to share your issues and request on patCit GitHub repository issue section.

Feel like coding today? - We will be more than happy to receive any contributions from you and the community. We have already started to tag some issues with good first issue and help wanted.

Team

This project was initiated by Gaรฉtan de Rassenfosse (EPFL) and Cyril Verluise (Collรจge de France) in 2019.

Since then, it has benefited from the contributions of Gabriele Cristelli (EPFL), Francesco Gerotto (Sciences Po), Kyle Higham (Hitsotsubashi University) and Lucas Violon (HEC Paris).

We are also thankful to Domenico Golzio for constant support and to @leflix311, @kermitt2, Tim Simcoe (Boston University) @SuperMayo and @wetherbeei for helpful comments.

Contribution details are available in CRediT.