instdin Profile Banner
Institutional Data Initiative Profile
Institutional Data Initiative

@instdin

Followers
145
Following
3
Media
5
Statuses
14

A research center at Harvard working to strengthen society’s connection to knowledge by advancing our access to and understanding of the data that shapes AI.

Joined August 2024
Don't wanna be here? Send us removal request.
@instdin
Institutional Data Initiative
21 days
RT @leppert: This Monday, @instdin will host @petrknoth to share his experience leading CORE ("The world’s largest collection of open acces….
0
2
0
@instdin
Institutional Data Initiative
24 days
RT @leppert: Tomorrow, it's our pleasure to host @ayahbdeir to talk about the power of data in building an AI ecosystem that's open, transp….
0
2
0
@instdin
Institutional Data Initiative
28 days
We hope Institutional Books will be the beginning of a process that makes millions more books accessible to the public for a variety of uses. We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.
0
0
2
@instdin
Institutional Data Initiative
28 days
We look forward to growing Institutional Books through community. We welcome collaboration from researchers and model makers as we:.- Evaluate the dataset’s impact on model outputs.- Continuing to refine our OCR pipelines. View the dataset on Hugging Face:
1
0
3
@instdin
Institutional Data Initiative
28 days
As part of our refinement work, we supplemented the original OCR-extracted text with a post-processed version that utilizes line detection to reassemble the text according to the line type.
Tweet media one
1
0
0
@instdin
Institutional Data Initiative
28 days
We included extensive volume-level metadata with both original and generated components, such as results from text-level language detection.
Tweet media one
1
0
0
@instdin
Institutional Data Initiative
28 days
We analyzed the dataset’s coverage across time, topic, and language and found:.- 40% of English text + long tail of 254 languages.- 20 clear topical tranches.- Largely published in the 19th and 20th centuries. Technical report here:
Tweet media one
Tweet media two
Tweet media three
1
0
1
@instdin
Institutional Data Initiative
28 days
Today we released Institutional Books 1.0, a 242B token dataset from Harvard Library's collections, refined for accuracy and usability. 🧵
Tweet media one
4
13
38
@instdin
Institutional Data Initiative
4 months
RT @felchang: I've loved writing words, while loops and wandering wectors, so I'm thrilled to join the @instdin team at Harvard as the dire….
0
2
0
@instdin
Institutional Data Initiative
4 months
RT @leppert: As the Institutional Data Initiative (@instdin) expands its mission, we’re announcing a collaboration with the Boston Public L….
0
7
0
@instdin
Institutional Data Initiative
4 months
RT @leppert: I'm pleased to announce we're expanding our mission at the Institutional Data Initiative (@instdin) with an open call for inst….
0
5
0
@instdin
Institutional Data Initiative
7 months
RT @leppert: Today we're launching the Institutional Data Initiative to work with libraries, gov agencies, and other knowledge institutions….
0
16
0
@instdin
Institutional Data Initiative
7 months
0
1
2
@instdin
Institutional Data Initiative
7 months
Hello world. 🧵
Tweet media one
1
5
7