 
            
              Institutional Data Initiative
            
            @instdin
Followers
                160
              Following
                3
              Media
                7
              Statuses
                19
              A research center at Harvard working to strengthen society’s connection to knowledge by advancing our access to and understanding of the data that shapes AI.
              
              Joined August 2024
            
            
           What is the pathway towards greater diversity in data and AI? Hear from Professor Ruth Okediji, scholar of IP Law at Harvard Law School, who will be in conversation with Assistant Dean Amanda Watson of the Harvard Law School Library on Oct 22 at 2PM.  https://t.co/20ymDIQmVK 
          
          
                
                0
              
              
                
                1
              
              
                
                1
              
             What is the pathway towards greater diversity in data and AI? Hear from Professor Ruth Okediji, scholar of IP Law at Harvard Law School, who will be in conversation with Assistant Dean Amanda Watson of the Harvard Law School Library on Oct 22 at 2PM.  https://t.co/20ymDIQmVK 
          
          
                
                0
              
              
                
                1
              
              
                
                1
              
             Join us tomorrow at 10AM EST:  https://t.co/FakraXOEzv 
          
           Can a small visual language model read documents as effectively as models 27 times its size? Next Friday, IDI will host Michele Dolfi and Peter Staar from @IBMResearch Zurich to discuss their work on SmolDocling, an “ultra-compact” model for diverse OCR tasks. 
            
                
                0
              
              
                
                0
              
              
                
                0
              
             Can a small visual language model read documents as effectively as models 27 times its size? Next Friday, IDI will host Michele Dolfi and Peter Staar from @IBMResearch Zurich to discuss their work on SmolDocling, an “ultra-compact” model for diverse OCR tasks. 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             This Monday, @instdin will host @petrknoth to share his experience leading CORE ("The world’s largest collection of open access research papers") as the rise of AI brings new meaning, and challenges, to stewarding knowledge repositories. Join us virtually via the link below. 
          
                
                1
              
              
                
                2
              
              
                
                2
              
             Tomorrow, it's our pleasure to host @ayahbdeir to talk about the power of data in building an AI ecosystem that's open, transparent, and fair. 11am ET on June 17th. Register at the link below to attend virtually. Cohosted by the @instdin and @BKCHarvard. 
          
                
                1
              
              
                
                2
              
              
                
                5
              
             We hope Institutional Books will be the beginning of a process that makes millions more books accessible to the public for a variety of uses. We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.  https://t.co/gPuXtKAayI 
          
          
            
            institutional.org
              Institutional Books 1.0 is our first release of public domain books. This set was originally digitized through Harvard Library’s participation in the Google Books project..
            
                
                0
              
              
                
                0
              
              
                
                2
              
             We look forward to growing Institutional Books through community. We welcome collaboration from researchers and model makers as we: - Evaluate the dataset’s impact on model outputs - Continuing to refine our OCR pipelines View the dataset on Hugging Face: 
          
            
            huggingface.co
            
                
                1
              
              
                
                0
              
              
                
                3
              
             As part of our refinement work, we supplemented the original OCR-extracted text with a post-processed version that utilizes line detection to reassemble the text according to the line type. 
          
                
                1
              
              
                
                0
              
              
                
                0
              
             We included extensive volume-level metadata with both original and generated components, such as results from text-level language detection. 
          
                
                1
              
              
                
                0
              
              
                
                0
              
             We analyzed the dataset’s coverage across time, topic, and language and found: - 40% of English text + long tail of 254 languages - 20 clear topical tranches - Largely published in the 19th and 20th centuries Technical report here:  https://t.co/Z7gvo4qCoe 
          
          
                
                1
              
              
                
                0
              
              
                
                1
              
             Today we released Institutional Books 1.0, a 242B token dataset from Harvard Library's collections, refined for accuracy and usability. 🧵 
          
                
                3
              
              
                
                13
              
              
                
                37
              
             I've loved writing words, while loops and wandering wectors, so I'm thrilled to join the @instdin team at Harvard as the director of community and communications!  https://t.co/B6ZgWRAevG 
          
          
                
                2
              
              
                
                2
              
              
                
                11
              
             As the Institutional Data Initiative (@instdin) expands its mission, we’re announcing a collaboration with the Boston Public Library (@BPLBoston) to develop AI-driven tools capable of accelerating new digitization at libraries across the world, starting at BPL. 🧵 
          
                
                1
              
              
                
                7
              
              
                
                6
              
             Today we're launching the Institutional Data Initiative to work with libraries, gov agencies, and other knowledge institutions to help refine and publish their collections as data, with an eye toward AI. 🧵 
          
          
                
                2
              
              
                
                16
              
              
                
                22
              
             
            