4. Dataset

GENDER BALANCE IN THE ARGENTINA ASTRONOMY WORKFORCE

This dataset is published in http://dryad/datasets/astrogen, see that link for full access to the data.

4.1. Metadata

Dataset compiled from several oficial and public sources about the career development for astronomers in Argentina.

  • Marcelo Lares [1, 2, 3] ORCID:

  • Valeria Coenda [1, 2, 3] ORCID:

  • Luciana Gramajo [1, 2, 3] ORCID:

  • Héctor Julián Martínez-Atencio [1, 2, 3] ORCID:

  • Celeste Parisi [1, 2, 3] ORCID:

  • Cinthia Ragone [1, 3] ORCID:

Affiliations:

    1. Instituto de Astronomía Teórica y Experimental (IATE)

    1. Observatorio Astronómico de Córdoba (OAC)

    1. CONICET

Contact: Marcelo Lares

Date of data collection: Nov 29, 2021

GEOGRAPHIC LOCATION: Argentina

KEYWORDS: gender balance, astronomy

LANGUAGE: English

Funding sources: The authors acknowledge founding from CONICET and SECYT, although to granted for this project especifically.

4.2. Data and file overview

We provide a single file containing an SQL database with five tables:

table

#elements

#columns

famaf

210

6

people

838

19

papers

341825

9

  • famaf

Count of male and female students, by year and year of enrollment.

Column

Name

format

description

1

year

INT

year the count of students is made

2

year_in

INT

year of enrollment

3

mi

INT

number of male active students in year “year” that enrolled in year “year_in”.

4

me

INT

number of male students that obtain the degree in year “year” and enrolled in year “year_in”.

5

fi

INT

number of female active students in year “year” that enrolled in year “year_in”.

6

fe

INT

number of female students that obtain the degree in year “year” and enrolled in year “year_in”.

  • people

List of astronomers in Argentina

Column

Name

format

contents

1

Author ID

INT

Unique identifier

2

age

INT

age [years]

3

gender

CHAR

gender (m, f)

4

Hindex

INT

H-index for publication set

5

Npapers

INT

number of papers

6

cc07

INT

category in CONICET in 2007

7

cc08

INT

category in CONICET in 2008

8

cc09

INT

category in CONICET in 2009

9

cc10

INT

category in CONICET in 2010

10

cc11

INT

category in CONICET in 2011

11

cc12

INT

category in CONICET in 2012

12

cc13

INT

category in CONICET in 2013

13

cc14

INT

category in CONICET in 2014

14

cc15

INT

category in CONICET in 2015

15

cc16

INT

category in CONICET in 2016

16

cc17

INT

category in CONICET in 2017

17

cc18

INT

category in CONICET in 2018

18

cc19

INT

category in CONICET in 2019

19

cc20

INT

category in CONICET in 2020

  • papers

List of papers

Column

Name

format

contents

1

ID

INT

Author identifier

2

journal

INT

journal name

3

journal_Q

INT

Q index for journal (from SCIMAGO). 0: not indexed, 1: first quartile, 2: second quartile, 3: third quartile, 4: fourth quartile

4

year

INT

year of the publication

5

citation_count

INT

number of citations (at the date of compilation)

6

author_count

INT

number of authors

7

author_pos

INT

position of author in author list

8

inar

INT

identifier of author affiliation. 0: not in Argentina, 1: in Argentina, 2: not declared.

9

filter

INT

automatic filter. 0: do not belong to the author, 1: assigned to the author

The papers have been classified by an automatic agent as belonging to the author. The full set of publications retrieved from the ADS service is classified according to this classifier, which gives the “filter” column as a result.

The fields ID allows to relate the tables “people” and “papers”.

4.3. Sample selections

We use a subset from the “people” table, corresponding to authors tha satisfy the following criteria:

  • Active on 2021 (last published paper in a Q1 journal and not from a large collaboration not before 2016)

  • Age in the range 25 to 85 years old

  • At least 75% of the Q1 papers (excluding large collaborations) published with an affiliation in Argentina

This dataset can be obtained from the database using, for example, the following SQL query:

select *,
       COUNT(*) as cc,
       MAX(p.year) as ymx,
       SUM(CASE WHEN p.inar=1 then 1 else 0 END) as N_inar,
       SUM(CASE WHEN p.inar=1 then 1 else 0 END) / (1.*COUNT(*)) as q
FROM papers as p
    INNER JOIN people as g
WHERE
        p.ID==g.ID
        AND
        g.age BETWEEN 25 AND 85
        AND
        p.journal_Q==1
        AND
        p.author_count<51
GROUP BY p.ID
HAVING
   ymx>2016
   AND
   q>0.75

Another subset is the one that comprise all researchers in CONICET at a given year.

The following query returns a subset from the “people” table corrresponding to active researchers at CONICET in 2020:

select * from people
where cc20 is not NULL

The list of publications from a given author can be obtained using the ID fields. For example, all the publications in top journals from the author with ID=35 can be obtained as:

select * from papers
where
  ID==35
  and
  journal_Q==1

SQL queries can be easily run, either using apropriate software (e.g. DB browser for sqlite) or using sqlite3 in python.

Let be the astrogen object a string containing the root directory of the project. The following code allows to read all researchers that where category I in 2015 and category II in 2020.

from os import path
from sqlite3 import connect

db = path.join(astrogen, 'data/redux/astrogen_DB_anonymous.db')
conn = connect(db)
c = conn.cursor()
query = ('''
         select *
         from people
         where
           cc17==1
           AND
           cc20==2
         ''')
c.execute(query)
df = pd.DataFrame(c.fetchall())
conn.close()

4.4. Validation of the publication lists

Once the list of publications retrieved from the Astronomical Data Service (ADS) has been classified using a Support Vector Machine model, we prepare pages for each author in order to visually verify possible sources or error.

A sample page can be found here. In these pages we include a link to the ADS entry on the author, using the same search string that was used when retrieving information from the ADS server using the ADS python package.

We also include a link to the ADS page of the author requesting only papers with at most 50 authors and excluding the BAAA publication, which is a proceeding from the AAA annual meetings that authors use to have many entries.

The check marks are the automatic selection made by the classifier, and these pages allow to correct false positives or false negatives by changing the tickmarks and saving a file with the updated filter.

logo

Example of the saving button:

logo

4.5. References

FILE FORMATS. Cornell Research Data Management Service Group. http://data.research.cornell.edu/content/file-formats

FILE MANAGEMENT. Cornell Research Data Management Service Group. http://data.research.cornell.edu/content/file-management