COSYNE reviewer feedback
tl;dr : Crowd-sourcing raw scores for your COSYNE reviewer feedback.
Following that message:
Dear community,
COSYNE is a great conference which plays a pivotal role in our field. If you have submitted an abstract (or several) you have recently received your scores. I am not affiliated to COSYNE - yet willing to contribute in some way: I would like to ask one minute of your time to report the raw scores from your reviewers. I will summarize in a few lines the results in one week time (11/02). The more numerous your feedbacks the better their precision!
Thanks!
As of 2022-02-20, I had received $N = 98$ answers from the google form (out of them, $95$ are valid) out of the $881$ submitted abstracts. In short, the result is that the total score $S$ is simply the linear sum of the scores $s_i$ given by each reviewer $i$ relatively weighted by the confidence levels $\pi_i$ (as stated in the email we received from the chairs):
$$ S = \frac{ \sum_i \pi_i\cdot s_i}{\sum_i \pi_i} $$Or if you prefer $$ S = \sum_i \frac{\pi_i}{\sum_j \pi_j} \cdot s_i $$
We deduce from that formula that the threshold is close to $6.34$ this year:
More details in the notebook (or directly in this post) which can also be forked here and interactively modified on binder.
EDIT: On 2022-02-20, I have updated the notebook to account for new answers, I have now received $N = 98$ answers (out of them, $95$ are valid), yet nothing changed qualitatively. On 2022-02-11, I had received $N = 82$ answers from the google form (out of them, $79$ are valid) and the estimated threshold wass close to $6.05$.
Other resources I found:
- https://neuroecology.wordpress.com/2020/02/27/cosyne2020-by-the-numbers/
- https://charlesfrye.github.io/stats/2019/03/06/cosyne19-gender-bias.html : on gender balance
- https://twitter.com/jmourabarbosa/status/1488432239692107778 : on the correlation between the scores of reviewers.
importing data¶
The data was collected using a Google form which thanks to a public link can be directly accessed to pandas:
import numpy as np
import pandas as pd
url = 'https://docs.google.com/spreadsheets/d/1F2ptf6mlwvV5jaAv6iQusFbDElTIjrXi-HhURmz8E_0/export?format=csv'
score_sheet = pd.read_csv(url)
score_sheet.tail()
First, thanks for the people leaving comments:
score_sheet[score_sheet['Comments?'].notna()]['Comments?']
Always useful, but let's leave it aside for the quantitative anlysis:
score_sheet = score_sheet.drop(['Comments?'], axis=1)
score_sheet.tail()
A quick sanity check for missing data:
for i in [1, 2, 3]:
print(score_sheet[score_sheet[f'Reviewer #{i} score'].isna()])
print(score_sheet[score_sheet[f'Reviewer #{i} confidence'].isna()])
It seems we should remove lines 0
, 9
and 52
to get a cleaner score-sheet and avoid overkill hacks.
score_sheet = score_sheet.drop([0, 9, 52], axis=0)
score_sheet.head()
Finally, the scores are integers and should be converted from the float
format imported from google forms.
for i in [1, 2, 3]:
score_sheet[f'Reviewer #{i} score'] = score_sheet[f'Reviewer #{i} score'].astype(int)
score_sheet[f'Reviewer #{i} confidence'] = score_sheet[f'Reviewer #{i} confidence'].astype(int)
score_sheet.head()
more data¶
The message received by cosyne@confmaster.net
mentions some numbers:
reviewer_pool = 215
total_reviews = 2639 # out of 2643 - why is that number of 5 out of 2643 mentionned?
submitted_abstracts = 881
analyzing raw scores¶
Now that we have all the date in hand, let's do a quick analysis.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(13, 5))
cols = ['Reviewer #1 score', 'Reviewer #2 score', 'Reviewer #3 score']
ax = score_sheet[cols].plot.hist(stacked=True, ax=ax, bins=9)
ax.set_xlabel('Reviewer score (from 1 to 10)')
ax.set_xlim(1, 10)
ax.set_xticks(np.arange(1, 10)+.5)
ax.set_xticklabels(np.arange(1, 10))
ax.set_ylabel('cumulated score');
fig, ax = plt.subplots(figsize=(13, 5))
cols = ['Reviewer #1 confidence', 'Reviewer #2 confidence', 'Reviewer #3 confidence']
ax = score_sheet[cols].plot.hist(stacked=True, ax=ax, bins=5)
ax.set_xlabel('Reviewer confidence (from 1 to 5)')
ax.set_xlim(1, 5)
ax.set_xticks(np.arange(1, 6)*5/6+.5)
ax.set_xticklabels(np.arange(1, 6))
ax.set_ylabel('#');
accepted = score_sheet[score_sheet['Abstract accepted?']=='Yes']
print(f"Number of accepted abstracts = {len(accepted)}")
print(f"Percent of accepted abstracts in survey = {len(accepted)/len(score_sheet)*100:.1f}%")
retrieving the razor score¶
The message mentions the method:
Each review comprised a short comment and a score between 1 and 10. Individual scores were weighed by a confidence factor and averaged for each submission.
This is my attempt at deriving a score:
p = 1 # trying out different norms... confidence=variance? confidence=std?
total_score = score_sheet['Reviewer #1 score'] * score_sheet['Reviewer #1 confidence']**p
total_score += score_sheet['Reviewer #2 score'] * score_sheet['Reviewer #2 confidence']**p
total_score += score_sheet['Reviewer #3 score'] * score_sheet['Reviewer #3 confidence']**p
total_weight = score_sheet['Reviewer #1 confidence']**p
total_weight += score_sheet['Reviewer #2 confidence']**p
total_weight += score_sheet['Reviewer #3 confidence']**p
score = total_score / total_weight
For which the histogram looks like:
fig, ax = plt.subplots(figsize=(13, 5))
ax = score.hist(bins=np.arange(1, 10), ax=ax)
ax.set_xlabel('Total score (from 1 to 10)')
ax.set_ylabel('#');
While the distribution of the sum of confidences is
fig, ax = plt.subplots(figsize=(13, 5))
ax = total_weight.hist(bins=np.arange(1, 16), ax=ax)
ax.set_xlabel('Sum of confidences (from 3 to 15)')
ax.set_ylabel('#');
Let's scatter plot the outcome as a function of the score:
fig, ax = plt.subplots(figsize=(13, 5))
ax.scatter(score, score_sheet['Abstract accepted?']=='Yes', marker='+', alpha=.4, s=100)
ax.set_xlabel('Total score (from 1 to 10)')
ax.set_yticks(np.arange(0, 2))
ax.set_yticklabels(['No', 'Yes'])
ax.set_ylabel('Accepted?');
An (overkill) method would be to fit a sigmoid... Let's rather look at the threshold:
It was mentioned in the message that:
After considering additional constraining factors, the top scoring 54 % of submissions were accepted.
score_quantile = .54
print(f"Official percent of accepted abstracts = {score_quantile*100:.1f}%")
threshold = score.quantile(score_quantile)
print(f"threshold score for an accepted abstract {threshold:.3f}")
false_negatives = score[(score > threshold) & (score_sheet['Abstract accepted?']=='No')]
print(f"Abstracts rejected above the threshold = {len(false_negatives)}")
false_positives = score[(score < threshold) & (score_sheet['Abstract accepted?']=='Yes')]
print(f"Abstracts accepted below the threshold = {len(false_negatives)}")
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(score, score_sheet['Abstract accepted?']=='Yes', marker='+', alpha=.4, s=100)
ax.scatter(false_positives, np.ones_like(false_positives), marker='+', alpha=.4, s=100, c='r')
ax.scatter(false_negatives, np.zeros_like(false_negatives), marker='+', alpha=.4, s=100, c='r')
ax.vlines([threshold], ymin=.0, ymax=1., colors='r', linestyles='--')
ax.set_xlabel('Total score (from 1 to 10)')
ax.set_yticks(np.arange(0, 2))
ax.set_yticklabels(['No', 'Yes'])
ax.text(threshold + 1, .5, f'{threshold=:.3f}')
ax.set_ylabel('Accepted?')
plt.tight_layout();
fig.savefig('2022-02-11_COSYNE-razor.png')
This result may be (certainly) due to an error in reporting the score in the google form or to the "additional constraining factors" mentioned in:
After considering additional constraining factors, the top scoring 54 % of submissions were accepted.
gray zone¶
It seems there is a "gray zone" for abstracts that were between the minimal score for accepted abstracts and the maximal score for rejected ones:
score_min = score[score_sheet['Abstract accepted?']=='Yes'].min()
print(f"Minimal score for an accepted abstract {score_min:.3f}")
score_max = score[score_sheet['Abstract accepted?']=='No'].max()
print(f"Maximal score for a rejected abstract {score_max:.3f}")
gray_zone = score[(score_min < score) & ( score < score_max)]
print(f"Abstracts in gray zone {len(gray_zone)}")
print(f"Percent abstracts in gray zone = {len(gray_zone)/len(score)*100:.1f}%")
print(f"Predicted total abstracts in gray zone = {int(len(gray_zone)/len(score)*submitted_abstracts)}")
Out of the total of $881$ abstract, it is certainly worth to put more attention at these $100$ abstracts which are closer to the threshold. Considering that these are certainly the ones that are less likely to go to such a conference (students, minorities, lower-ranked universities) it is an important issue to better consider their scientific value.
bonus: reliability of score for individual abstracts across reviewers¶
Similarly to that tweed by Sdrjan Ostojic I also looked at a dependance between reviewers scores, but this is still in progress. Any help is welcome (you can fork the notebook).
import seaborn as sns
opts = dict(kind="reg", scatter_kws=dict(alpha=.3, s=100))
sns.jointplot(data=score_sheet, x='Reviewer #1 score', y='Reviewer #2 score', **opts)
sns.jointplot(data=score_sheet, x='Reviewer #1 score', y='Reviewer #3 score', **opts);
sns.jointplot(data=score_sheet, x='Reviewer #3 score', y='Reviewer #2 score', **opts);
dependance of score and confidence¶
sns.jointplot(data=score_sheet, x='Reviewer #1 score', y='Reviewer #1 confidence', **opts);
sns.jointplot(data=score_sheet, x='Reviewer #2 score', y='Reviewer #2 confidence', **opts);
sns.jointplot(data=score_sheet, x='Reviewer #3 score', y='Reviewer #3 confidence', **opts);