 
            
              apoorv
            
            @_apoorvnandan
Followers
                5K
              Following
                711
              Media
                196
              Statuses
                416
              recreational coding + ml
              
              Joined November 2016
            
            
           Using categorical inputs in neural networks sounds trivial until its user IDs for 1 billion users and your nn.Embedding layer won't work because it needs 3TB of memory. Over the weekend, I explored how TikTok engineered for this scale in their recommendation system đź§µ 
          
                
                5
              
              
                
                27
              
              
                
                395
              
             my fav use case for vibe coding is building control panels for my projects 
          
                
                0
              
              
                
                0
              
              
                
                1
              
             My cursor workspace is slowly evolving into just two panes - the chat (60-70%) and the terminal (30-40%). No files, file explorer etc. 
          
                
                0
              
              
                
                0
              
              
                
                5
              
             You can read the unrolled version here:  https://t.co/EYO7pjLZGN  And subscribe for more such breakdowns. P.S. I am looking for contract work, so if you need help with an ML/AI project, send me a DM. 
          
                
                0
              
              
                
                0
              
              
                
                0
              
             An interesting idea they mention for future work is extending VAEs to language modelling. It could potentially making the smart compose suggestions more appropriate and diverse! 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             While performing beam search, they interpolate between the probability values given by the two models. Adding personalization improved ExactMatch scores as well as suggestion acceptance rates from the users. 
          
                
                1
              
              
                
                0
              
              
                
                0
              
             Now, let's talk about personalization: Users can have different writing styles and there are billions of them so they chose a lightweight n-gram language model which is easier to train and requires less data. It's also efficiently stored in a compact weighted finite automata. 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             Looking to balance inference quality and latency, they went ahead with method A of feeding inputs (faster due to smaller sequence lengths) and LSTMs (lower latency at slightly lower quality). 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             They applied beam search during inference and evaluated the models with two metrics: Log Perplexity and ExactMatch@N. ExactMatch for a predicted phrase that is N words long, the percentage of predicted phrase that exactly matches the first N words in the ground truth text. 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             Method B: You combine all contextual information along with the prefix text in one long input sequence. This is simpler but the sequence length is longer. 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             Second, the model: They experimented with LSTM and transformer models, and two different methods of feeding the inputs. Method A: The input sequence is the current e-mail body. The extra context is separately encoded into one embedding and combined with the input sequence. 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             They replace infrequent words and entities like personal names, URLs, e-mail addresses, phone numbers, etc. by special tokens so that the model is not exposed to them. Then, they perform word level tokenization. The vocabulary contains the most frequent 50k English words. 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             First up, preparing the data. They supplement e-mail contents with extra context: - Date and time: helps suggest good morning/evening, happy new year etc at the appropriate time - Locale of the user: helps the model distinguish between en-US and en-GB spellings 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             Challenges: - extremely low latency: inference on almost every keystroke - personalization at large scale (1.5B users) - privacy: model should never expose personal information - high quality suggestions in subtly different contexts 
          
                
                1
              
              
                
                0
              
              
                
                3
              
             Making Gmail’s smart compose system sounds trivial until you’re tasked with running inference for 1.5 billion users with 90th percentile latency under 60ms and personalization based on user’s writing style! Here’s a breakdown of how google approached this: 
          
                
                1
              
              
                
                2
              
              
                
                13
              
             nano-vllm: minimal reimplementation of vllm in 1200 lines of python 
          
                
                2
              
              
                
                6
              
              
                
                74
              
             if you wanna learn about neural nets, this is the most important plot you need to understand credits: @zhaisf
          
          
                
                1
              
              
                
                1
              
              
                
                11