2. Data pipeline
(these links are temporarily restricted to authors)
Shared Drive with data warehouse (requires access rights)
Overleaf document for paper (requires access rights)
Data repository (comming soon)
2.1. Get the data
The data used in this work has been collected from several sources, namely:
CONICET, “gobierno abierto > conicet en cifras” https://cifras.conicet.gov.ar/publica/
CONICET, “conicet digital” Repositorio institucional https://ri.conicet.gov.ar
Astronomy Institutes in Argentina:
Universities in Argentina
2.2. Feature construction
Age: We performed a non-linear least squares regression for the variables “DNI” and age, using as the training dataset that of the AAA original table.
Gender: We use the table compiled by Mustafa Atik to assign gender on the basis of the names. We have also tried other tools, e.g. genderize.io, throught the client https://github.com/SteelPangolin/genderize, GenderAPI web tool, with the same results.
2.3. Publications
We have performed a detailed analysis of publication data for each author. The process involves the following steps:
Download ADS publication data for each author using the name as the search key. At this stage we use the python packages ADS and PINNACLE.
Search and add ORCID keys
Train, evaluate and apply a machine learning model to clean the sample of papers. This is required since the search by name often return the entries for several authors with similar names.
Add metrics for journals, taking data from SCImago Journal & Country Rank public portal
Add publication metrics
The curated dataset allows to construct several indices:
number of authors per article
number of articles per author
number of articles per author, as leading author
distribution in time of articles
H-index
relative position of a given author in the authors list