SuproteemSarkar Profile Banner
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ Profile
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ

@SuproteemSarkar

Followers
513
Following
0
Media
32
Statuses
55

Assistant Professor @UChicago | @Harvard AB/SM โ€˜19, PhD โ€˜25

Joined May 2012
Don't wanna be here? Send us removal request.
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
9 months
Iโ€™m excited to further study questions related to finance, technology+innovation, and behavioral economics, and to extend the scope and credibility of machine learning in empirical research . You can find the full paper at
1
0
4
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
9 months
In summary:. I transform economic language into embedding vectors, and show these vectors are informative of perceptions and beliefs. I train LLMs that address credibility issues with ML in empirical research. I study economic mechanisms that drive valuation and misvaluation
Tweet media one
1
0
2
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
9 months
Valuations can fluctuate when perceptions changeโ€”is a customer service firm that adopts a language model really an โ€œAIโ€ firm, or just a โ€œserviceโ€ firm?. I find that these perception changes relate to selective attention, firm communication, and technology transformations
Tweet media one
1
0
4
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
9 months
Main results:.1๏ธโƒฃ Embeddings explain valuations + outperform traditional characteristics .2๏ธโƒฃ Returns reflect changes in how businesses are valued + changes in the perceived business model itself.3๏ธโƒฃ Some changes in embeddings reflect misperceptions, which generate misvaluation
Tweet media one
1
0
2
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
9 months
Embeddings also encode similarity between firmsโ€”geometric distance relates to established measures of perceived similarity. Taken together, these results demonstrate that a firmโ€™s embedding is informative of its perceived business model
Tweet media one
1
0
3
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
9 months
First, I train new language models on historical data to avoid lookahead bias. Iโ€™ve released the LLMs to the research community. Second, I use contrastive representation learning to construct embeddings of firms. The geometry of these vectors relates to economic features of firms
Tweet media one
1
0
4
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
9 months
I transform financial news language into embedding vectors. Embeddings put quantitative structure on unstructured data, and have contributed to the success of machine learning over the past decade
Tweet media one
1
0
5
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
9 months
Economic valuations fluctuate in ways empirical research cannot fully explain. What information are we missing? Economic theories emphasize the role of hard-to-quantify beliefs and perceptions. My job market paper develops algorithms + measurement to quantify perceptions of firms
Tweet media one
3
24
103
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
1 year
You can find the working paper at: Thank you to @keyonV for being an amazing coauthor. Comments are very welcome! (n/n)
Tweet media one
0
0
6
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
1 year
The ideal solution is to select from a family of LLMs with rolling cutoffs, or โ€œtime subscriptsโ€. For example, one such model family is StoriesLM: This is only a startโ€”there is a lot of room to further research and improve these kinds of models (8/n)
Tweet media one
1
0
3
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
1 year
Instead, we argue researchers should use language models with historical pretraining cutoffs. This way, analysis that uses these models is out of sample (7/n)
Tweet media one
Tweet media two
1
0
5
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
1 year
How can we address lookahead bias? Two ad hoc strategies have issues. [Prompting] We include instructions not to look ahead in our main results. There is still lookahead bias. [Masking] Censoring identifying information from prompts doesnโ€™t make the prompts unidentifiable (6/n)
Tweet media one
1
1
2
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
1 year
We next show that language models can be used to predict the results of โ€œnatural experimentsโ€ that are commonly considered unpredictable. We show a language model-based analysis procedure can predict the results of close U.S. House elections (5/n)
Tweet media one
1
0
2
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
1 year
We argue a model can also leak future information indirectly. Generated risk factors for 2020 are more likely to include sequences like โ€œdisease outbreakโ€ and โ€œsupply chainโ€ than generated risk factors for 2019 (4/n)
Tweet media one
1
0
5
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
1 year
We argue if a model systematically generates language sequences from the future, it directly leaks future information. We scale this exercise to 1,000 earnings calls from September 2019 โ€“ November 2019. We find 6.8% of generated risk factors for 2020 include โ€œCOVID-19โ€ (3/n)
Tweet media one
1
0
2
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
1 year
An example: We query a language model with Zoomโ€™s earnings call from September 2019, with instructions to generate the firmโ€™s risk factors. The output includes โ€œCOVID-19,โ€ a term introduced in 2020. It also mentions โ€œremote work,โ€ which became much more prominent in 2020 (2/n)
Tweet media one
1
1
3
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
1 year
Language modelsโ€™ pretraining data includes information about historical events. When we use LLMs to make economic predictions, do they leak information about the future?. @keyonV and I develop tests that find evidence of lookahead bias in LLMs, and identify ways forward (1/n)
Tweet media one
2
12
75
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
3 years
I cannot thank my collaborators enough for their ever-present energy, creativity, and insight throughout this process. Our dataset and example analyses: Our paper: Please, take a look and let us know what you think!. [n/n].
Tweet card summary image
arxiv.org
Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although...
1
1
12
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
3 years
Our dataset has numerous limitations. It is restricted to English-language text, and focuses on patent applications, which are not universally accessible and do not reflect all kinds of innovation. We discuss inequities in patent access and outcomes in our paper [9/n]
Tweet media one
1
0
1
@SuproteemSarkar
๐’๐ฎ๐ฉ๐ซ๐จ๐ญ๐ž๐ž๐ฆ ๐’๐š๐ซ๐ค๐š๐ซ
3 years
The answer may lie in differences in innovation criteria across patent categories, as well as heterogeneity in technical language. Our dataset provides rich metadata on inventors and applications that can be merged with other sources to study the drivers of these gaps [8/n].
1
0
2