2. Data pipeline

(these links are temporarily restricted to authors)

Shared Drive with data warehouse (requires access rights)

Github repository for codes

Overleaf document for paper (requires access rights)

Data repository (comming soon)

2.1. Get the data

The data used in this work has been collected from several sources, namely:

CONICET, “gobierno abierto > conicet en cifras” https://cifras.conicet.gov.ar/publica/

CONICET, “conicet digital” Repositorio institucional https://ri.conicet.gov.ar

Asociación Argentina de Astronomía
Astronomy Institutes in Argentina:
- IATE
- IAFE
- ICATE
- IALP
- OAC
Universities in Argentina
- UNC
  - Lic. en Astronomía
  - Statistical data

2.2. Feature construction

Age: We performed a non-linear least squares regression for the variables “DNI” and age, using as the training dataset that of the AAA original table.
Gender: We use the table compiled by Mustafa Atik to assign gender on the basis of the names. We have also tried other tools, e.g. genderize.io, throught the client https://github.com/SteelPangolin/genderize, GenderAPI web tool, with the same results.

2.3. Publications

We have performed a detailed analysis of publication data for each author. The process involves the following steps:

Download ADS publication data for each author using the name as the search key. At this stage we use the python packages ADS and PINNACLE.
Search and add ORCID keys
Train, evaluate and apply a machine learning model to clean the sample of papers. This is required since the search by name often return the entries for several authors with similar names.
Add metrics for journals, taking data from SCImago Journal & Country Rank public portal
Add publication metrics

The curated dataset allows to construct several indices:

number of authors per article
number of articles per author
number of articles per author, as leading author
distribution in time of articles
H-index
relative position of a given author in the authors list