Ground Truth Labels and In-context Learning

With the introduction of pre-trained models and their adaption to downstream tasks (Howard & Ruder, 2018; Devlin et al., 2019), the natural language processing (NLP) field has had its imagenet-moment. Moreover, in-context learning (ICL) (Brown et al., 2020) has proven to be another great game-changer on top of pre-trained languages. However, not a lot of work has gone into exploring why ICL works and what elements are crucial to its performance. In this blog post, I go over some literature exploring why, how and when ICL works (Min et al., 2022; Yoo et al., 2022; Wei et al., 2023).


A few years ago, taking advantage of pre-trained languages became the way to approach a new task; by either adapting knowledge or fine-tuning the pre-trained model (Howard & Ruder, 2018; Peters et al., 2018) you could achieve state of the art performance. The battles of pre-trained languages, with their growing sizes began.

In 2020, (Brown et al., 2020) introduced what they call in-context learning (ICL) as a concept for sufficiently large models. The authors propose GPT-3, a model trained on a causal language modeling task on massive amounts of data. The resulting model shows the ability of ICL. ICL consists on models getting a prompt with instructions for the task to be carried out. Additionally, examples of the task done successfully can also included. The performance of ICL is suprisingly good and best of all: no parameter updating happens.

The Premise

We will first look at in-context learning as presented originally (Brown et al., 2020). Figure 1 shows a sample input illustrating how in-context learning is usually done. Generally a task description is included, examples can be included, and finally a prompt with the input for which a prediction is desired.

ICL input example
Fig. 1: In-context learning as presented originally (Brown et al., 2020).

The original paper calls this specific input an example of few-shot in-context learning. From this example we can infer that we can study at least the following aspects of ICL:

  1. Input-label mapping
  2. Distribution of inputs
  3. Label space
  4. Format of input

This blog post will discuss all of them, but will focus on the first one the most.

How does it work?

The overall mechanism of ICL is not well understood and many questions regarding why and how it works remain to be answered.

That said, recent studies (von Oswald et al., 2022; Dai et al., 2022) have shown that internally, the model mimics a gradient descent approach via its attention mechanism to learn from the input prompt. This conditions the model to “learn” from the input and thus generate the right output.

However, understanding how exactly the input influences the ICL mechanism and what truly matters is something that will be explored further in this blogpost.

What Matters: The Claim and the Semi-Refutal

It has been historically assumed that a good set of examples is important for good ICL performance (Liu et al., 2022; Lester et al., 2021). However, what makes a good example remained to be explored. A recent study (Min et al., 2022) proposed what is important in a prompt.

The Claim

In their paper (Min et al., 2022) the authors explored 6 large language models (LLMs), ranging from 774 million parameters (GPT-2) to 175 billion parameters (GPT-3, see Table 1 for a full list of models evaluated). Two inference methods were tried: direct and channel. 6 different tasks were evaluated: sentiment analysis, paraphrase detection, sentence completion, NLI, hate speech detection, question answering. These tasks were evaluated on 26 datasets total.

Model # Params Public Meta-trained
GPT-2 Large 774M ✔️
MetaICL 774M ✔️ ✔️
GPT-J 6B ✔️
fairseq 6.7B 6.7B ✔️
fairseq 13B 13B ✔️
GPT-3 175B

Table 1: Models analyzed (Min et al., 2022)

The authors ran experiments in a \(k\)-shot setting, where they set \(k=16\) and ran each experiment on five different seeds. They used \(F_1\) and accuracy as their reported metrics. Their finding were as follows:

Input-label Mapping: the main result of the paper shows that assigning random labels (as opposed to the ground-truth labels) in the examples of the input has no significant impact to the ICL performance.

ICL input-label mapping experiment results.
Fig. 2: ICL performance not overly hurt when random labels are used (Min et al., 2022).

Number of [correct] labels: The effect described above seems to be consistent, regardless of what \(k\) is picked. See Figure 3. Experiments also show that after a certain \(k\) performance does not improve greatly.

Experiment results on varying k.
Fig. 3: Ablations on varying numbers of examples in the demonstrations (Min et al., 2022).

Why it works (other factors):

As previously mentioned, the authors explore other factors such as: label space, format and distribution of inputs.

Figures 4, 5 and 6 show their findings with respect to distribution of inputs, format, and label space respectively.

Experiment results on input distribution.
Fig. 4: Input distribution results. Distribution of input can be directly seen by comparing the middle bars (Min et al., 2022).
Experiments results for format factor.
Fig. 5: Format results (Min et al., 2022).
Experiments on other factors.
Fig. 6: Label space can be evaluated by looking at the middle bars (Min et al., 2022).

Briefly, they find that showing the label and input distribution, even if it is just independently, i.e. now showing a direct mapping, is crucial for ICL. If one of these aspects if missing but the right format is used, performance can still remain high.

The Semi-Refutal

With NLP moving so fast, such bold claims would not go unchallenged. In the same conference where the above paper was presented, a response paper (Yoo et al., 2022) already showed up challenging and refining the experiments presented to show that ground-truth labels do [sometimes] indeed matter. Figure 7 shows a case where ground-truth labels in the input clearly matter.

Experiment showing that ground-truth labels sometimes matter.
Fig. 7: Experiment showing that ground-truth labels sometimes matter (Yoo et al., 2022).

The authors of this work aim to propose more systematic metrics and experiments to analyzing this phenomenon. The authors use mostly the same tasks and datasets. Additionally, however, they propose the following.

Label-correctness Sensitivity: We’re not just interested in whether models are sensitive but how sensitive they are to label corruption. Thus, the authors define sensitivity as the slope to the modelling of the decaying performance as labels get corrupted:

\[y = \beta_0 + \beta_1 s\]

Where \(y\) is the obtained performance as measured by e.g. accuracy. \(s\) is the percentage of correctly labeled examples in the input. This would thus make \(\beta_1\) our sensitivity measure.

Ground-truth Label Effect Ratio: Not all tasks are made the same. Some are easier than others for ICL. Thus, the authors define a way to measure the impact of label corruption, or rather how much ground-truth labels improve performance over random labels.

\[GLER = \frac{y_{GT} - y_{RL}}{y_{GT} - y_{\emptyset}}\]

Where \(y_{GT}\) is the performance using ground-truth labels, \(y_{RL}\) is the performance of the model using random labels and \(y_{\emptyset}\) is the performance of the model in a zero-shot setting, i.e. no examples provided in the input.

This basically tells us how much ground-truth labels improve performance over random labels.

Main Results

Initial results

Table 2 shows how there is a clear positive correlation, i.e. there is sensitivity towards label corruption.

Method Coefficient Intercept \(R^2\)
GPT-NeoX Direct 0.300 0.327 0.810
GPT-J Direct 0.309 0.291 0.861

Table 2: Main results for regression fitted as per the sensitivity equation provided.

Additionally, figure 8 shows sensitivity and GLER results for the different datasets and tasks.

Initial results from Yoo et al.
Fig. 8: Initial results from the response paper (Yoo et al., 2022).

Task difficulty: the authors hypothesize that sensitivity is related to task difficulty. They propose the following metric and present the results in figure 9.

\[y_{rel} = y_{GT} - y_{baseline}\]
Relative difficulty vs sensitivity
Fig. 9: Relative difficulty against sensitivity (Yoo et al., 2022).

The authors conclude that sensitivity alone is not a powerful enough metric on its own. Something like task difficulty must also be taken into account.

When do ground-truth labels [not] matter?

The authors also suggets scenarios and factors where sensitivity may be reduced:

  • Using the channel method of inference as opposed to direct
  • Calibrating before using (Zhao et al., 2021) might also reduce sensitivity
  • The higher your \(k\) the more sensitive your model will be to some degree
  • They also show how the model size be positively correlated with sensitivity

Recent Discoveries

Aligned with the results regarding size of model discussed (Yoo et al., 2022), more recent work (Wei et al., 2023) has shown that not only are larger models more sensitive to label corruption, but also that they process ICL differently as they scale up. Emergent abilities are once again relevant here.

The main findings are as follows:

  1. Models get better at overriding semantic priors as they get larger, i.e. they’re better at “following” the input-label mapping in the input.
  2. Being able to use semantically unrelated labels is also an emergent ability. While (Yoo et al., 2022) also studied this, the scale of the models they used for this particular experiment were not sufficiently distinct to show this effect.
  3. Instruction tuned models improve at making use of semantic priors and also to make use of input-label mapping. However, the former is stronger than the latter.

Figures 10, 11 and 12 show these results in more detail.

Overriding power contrasted with scale.
Fig. 10: Models get better at overriding semantic priors as they get larger (Wei et al., 2023).
Unrelated label usage vs scale
Fig. 11: Models get better at making use of even unrelated labels in the prompt as they grow larger (Wei et al., 2023).
Overriding power vs size for instruction-tuned models results.
Fig. 12: Instruction tuned language models are worse at overriding semantic priors (Wei et al., 2023).


Understanding in-context learning, how, and why it works has shown a lot of progress in the past few months. Through the works discussed here, we can conclude that models are indeed sensitive to label corruption in the input. However, this does not always hold. This is particularly relevant for smaller language models.

Large language models are stronger at overriding semantic priors they have seen in their pre-training and following the input-label mapping in the input. They are even capable of making use of semantically unrelated labels used in the input.


  1. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 328–339.
  2. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186.
  3. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems (Vol. 33, pp. 1877–1901). Curran Associates, Inc.
  4. Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., & Zettlemoyer, L. (2022). Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11048–11064.
  5. Yoo, K. M., Kim, J., Kim, H. J., Cho, H., Jo, H., Lee, S.-W., Lee, S.-goo, & Kim, T. (2022). Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2422–2437.
  6. Wei, J., Wei, J., Tay, Y., Tran, D., Webson, A., Lu, Y., Chen, X., Liu, H., Huang, D., Zhou, D., & others. (2023). Larger language models do in-context learning differently. ArXiv Preprint ArXiv:2303.03846.
  7. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–2237.
  8. von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., & Vladymyrov, M. (2022). Transformers learn in-context by gradient descent. ArXiv Preprint ArXiv:2212.07677.
  9. Dai, D., Sun, Y., Dong, L., Hao, Y., Sui, Z., & Wei, F. (2022). Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta Optimizers. ArXiv Preprint ArXiv:2212.10559.
  10. Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., & Chen, W. (2022). What Makes Good In-Context Examples for GPT-3? Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 100–114.
  11. Lester, B., Al-Rfou, R., & Constant, N. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 3045–3059.
  12. Zhao, Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). Calibrate before use: Improving few-shot performance of language models. International Conference on Machine Learning, 12697–12706.

Guatemalan Spanish feature #2

Guatemalan Spanish is filled with interesting, at least according to me, features. I’ll be trying to cover some features in this blog whenever I’m bored and have nothing else to do. I hope some people will find them interesting.

Feature #2: Indefinite Article + Possessive + Noun

Now, this construction doesn’t seem to be exclusive to Guatemala, but it’s not prevalent anywhere outside of Central America (and maybe Southern Mexico). This construction has caught the eyes of some linguists in the past (Martin 1985).

An example of this construction would be something like:

Mire, no me regala una mi tacita de café?

“Can I maybe have a cup of coffee?”

This is a very old construction that used to be very frequent between the 13th and 15th century (Nieuwenhuijsen 2007), however it has only survived in the Americas. Earlier linguistic work appealed to the ‘convergence’ between Spanish and Maya languages as that could explain how the feature survived in Guatemalan Spanish but not in Standard Spanish. However, studies have shown how it is prevalent regardless of whether speakers are monolingual or bilingual. Linguists have considered how the parallel construction in Mayan languages could have contributed to this construction’s expansion, but it has been considered mainly an aid in persistence of the construction and not direct influence (Elsig 2017).

Another theory presents this as a form of refunctionalization, or exaptation. This means that a grammatical form, on the verge of disappearing, is taken and given a new meaning and assigned a new value and survives this way (Pato 2018).

This construction can represent different semantic aspects (Pato 2018); in the example above it is used as emphatic or a marker to decrease the value of something, in this case the value of a cup of coffee.

However, it can also have a partitive value, expressing how basically the speaker is talking about ‘one of many’:

Fijate que a un mi primo lo asaltaron ayer, vos.

“One of my cousins got mugged yesterday”

This usage would forbid something like e.g. una mi esposa, but allows for both animate and inanimate objects of course.

Additionally, there is a habitual or iterative value possibility:

Él siempre se va a tomar una su cerveza ahí antes de ir a casa

“He always has a beer there before going home”

And finally it can feature a discursive-pragmatic value and mark some form of special value when narrating a story or tale:

Eran muy pobres, tenían una su vaquita que ordeñaban… m… d‘eso vivían, de su lechita

[C. A. Lara Figueroa, Cuentos populares de encantos y sortilegios en Guatemala, 1992, p. 130, in (Palacios Alcaine 2004)]

“They were very poor, they had a cow that they milked… m… they made a living out of that, of its milk”

This is one of the more beautiful features in Guatemalan Spanish for various reasons:

  1. It’s always been present
  2. There’s no stigma attached to it (as opposed to, sadly, a lot of the more distinctive features)
  3. It’s part of every register in the language

Meta Analysis on the GradStats App

A couple of months ago I realized a web app that lets people play around with historical graduate admissions data. Since then, thousands of people have used it and I took the liberty of tracking what people are searching for on the app.

Note: I don’t track your personal info, IP address, or anything. I only track what is being searched for and at what time.

I wrote about the app on this blog and posted a link to it on reddit. And so, by now, I think enough people have used the app that some trends will be noticeable. As I am not logging anything related to the users, this analysis is gonna be pretty limited, but it should still be fun.

Major popularity distribution
Fig. 1: Major popularity distribution
Major popularity distribution for more majors
Fig. 2: Major popularity distribution for more majors

So, there’s a clear trend regarding CS and ECE as majors. The majority of users came to the app from reddit. I do wonder if that population is extremely biased to those majors and/or STEM in general. I will be sharing the app on Twitter and other platforms in hopes of diversifying the audience.

What degree are people shooting for?

Degree popularity distribution
Fig. 3: Degree popularity distribution

It seems like people are both shooting for PhD and MS degrees. MBA people don’t really seem to be too interested in finding out what the competition is doing. Then again, this could just be due to reddit’s biased population. GradCafe itself doesn’t have that many entries for MBA degrees, so it could just be that the sort of stats gathered on GradCafe aren’t too relevant for MBA applicants.

Where are people applying from?

As I mentioned in this post before, are I’m not tracking information on the users themselves. However, they are able to filter GradCafe entries by status, i.e. whether a student is American, International, etc.

Status frequency distribution
Fig. 4: Status frequency distribution

It seems like the app is particularly popular for people outside the US, specifically international applicants with degrees from outside the US.

As my initial analysis showed, this is probably not due to international applicants getting results later than American applicants.

Making a plot for this would’ve probably been a lot tougher than for other aspects of the app. For this reason, I decided to simply create a word map that would representate how popular some schools are and what schools you guys are search for stats for.

Word cloud for school names
Fig. 5: Word cloud for school names

This could be because these schools are the most competitive and thus most people will consider them, take a look at the stats, and then decided whether to apply or not. It could also just be people checking what top universitiy stats look like. And it, of course, could be a combination of both reasons. Either way, these top schools are probably, once again, influenced by the bias in the population for this app.

A look into resources

Memory usage throughout time
Fig. 6: Memory usage throughout time
Usage across time
Fig. 7: Usage across time

Usage has gone down as acceptances and rejects have gone out. That, and the fact that the effects from hitting the top of the subreddit are fading, makes it so the traffic for this app will be cyclical throughout each year.

Grad Stats App -- Grad Admissions Data Made Easy

Continuation of this post

TL;DR Here’s the link to the app:

Recently I wrote about GradCafe stats and understandbly, people wanna know about stats for specific schools, programs, cycles, etc.

So I built an app that lets you query the scraped data directly: Here’s a quick overview of the app in the form of a video.

You can basically query for the following fields:

  • Institution
  • Major
  • Type of degree
  • Season/Cycle
  • Status
  • and any combination of the above

You can also personalize somewhat the look of the graphs, but really the meat of the data is what’s interesting here.

Motivation behind the project

The motivation behind this project was just to democratize GradCafe data in a way that’s easily accessible to people regardless of their background and especially regardless of their coding skills. My previous attempt at doing this simply involved democratizing the way you scraped data; hopefully this attempt will be a lot more suitable in considering the diversity in graduate school applicants.

In line with my objective, the app was built using streamlit which made things really easy. The code is available on Github for anyone to check out but it’s definitely nothing impressive.

Known issues

Server running the app is pretty weak so it probably can’t tolerate a Reddit Hug of Death yet, but I’m trying to optimize it further and then I’ll try to make it so a not-too-expensive server can handle the app just fine.

Anyway, here’s the link: Grad Stats

Why Is There No D# Major Scale

Hey great question!

Good news! Your intuition isn’t wrong. There is a \(D\sharp\) major scale. It does exist, it’s just… not useful. This is where enharmonics shine. I’ll try to illustrate why.

The \(D\sharp\) major scale would be as follows (following the pattern):

\[D\sharp\ E\sharp\ F\sharp\sharp\ G\sharp\ A\sharp\ B\sharp\ C\sharp\sharp\ D\sharp\]

As you can see, every note has a sharp and \(C\) and \(F\) even have double(!) sharps. There’s also the fact that \(E\sharp\) is simply \(F\) really, but due to theory we notate it as \(E\sharp\) nevertheless, there’s also \(B\sharp\), etc., etc. Maybe you can kinda see why it could be less than ideal to deal with a scale like this.

Now, while this is an actual scale and is completely valid, reading music written in this key can be quite uncomfortable especially considering there’s a perfectly valid alternative.

What’s the enharmonic note for \(D\sharp\)? It’s \(E\flat\). Its major scale would be:

\[E\flat\ F\ G\ A\flat\ B\flat\ C\ D\ E\flat\]

Now this seems much more reasonable. We only have 3 flats and reading it should be completely fine. Key signature is much more simple to grasp and it’s just more straightforward to deal with \(E\flat\) instead of \(D\sharp\).

You’ll learn more about reading and key signatures later on in the course, but I hope this helped explain why some people might just say “there’s no X scale” to simplify things.

NOTE: This post was copied from a response I wrote years ago on the Coursera forums.