2. Data pipeline

(these links are temporarily restricted to authors)

2.1. Get the data

The data used in this work has been collected from several sources, namely:

2.2. Feature construction

  • Age: We performed a non-linear least squares regression for the variables “DNI” and age, using as the training dataset that of the AAA original table.

  • Gender: We use the table compiled by Mustafa Atik to assign gender on the basis of the names. We have also tried other tools, e.g. genderize.io, throught the client https://github.com/SteelPangolin/genderize, GenderAPI web tool, with the same results.

2.3. Publications

We have performed a detailed analysis of publication data for each author. The process involves the following steps:

  • Download ADS publication data for each author using the name as the search key. At this stage we use the python packages ADS and PINNACLE.

  • Search and add ORCID keys

  • Train, evaluate and apply a machine learning model to clean the sample of papers. This is required since the search by name often return the entries for several authors with similar names.

  • Add metrics for journals, taking data from SCImago Journal & Country Rank public portal

  • Add publication metrics

The curated dataset allows to construct several indices:

  • number of authors per article

  • number of articles per author

  • number of articles per author, as leading author

  • distribution in time of articles

  • H-index

  • relative position of a given author in the authors list