To learn how ChatGPT works, try asking her. She is reluctant to reveal some details. She claims she does not know. See the underlined portion below. But she describes a lot else about herself. Here is an exchange I had with her:
ME: You told me that the largest version of GPT-3 had 175 billion parameters. Is each parameter stored as 1 byte? Or some other number of bits?
ChatGPT: No, each parameter in GPT-3 is not stored as 1 byte. Instead, each parameter is typically stored as a 16-bit or 32-bit floating-point number, depending on the implementation. This is because floating-point numbers can represent a wider range of values and provide greater precision than integers.
For example, GPT-3's 175 billion parameters would require around 700 GB of memory if each parameter was stored as a 32-bit floating-point number. If each parameter was stored as a 16-bit floating-point number, it would require around 350 GB of memory.
It's worth noting that the actual memory usage of GPT-3 may be different due to various optimizations and compression techniques used by the implementation.
ME: With the 32-bit floating point number configuration, I assume all 700 GB are online in RAM. Not overlaid. Correct?
ChatGPT: No, it's not necessary for all 700 GB of GPT-3's parameters to be stored in RAM at the same time. In fact, it's not practical to store such a large model entirely in memory on most machines.
Most deep learning frameworks use a technique called "memory-mapping" to allow the model parameters to be loaded and unloaded from disk as needed during training or inference. Memory-mapping allows the model parameters to be accessed as if they were in memory, even though they are stored on disk.
In practice, the memory usage of GPT-3 will depend on the specific implementation and hardware configuration. Some implementations may use techniques like model parallelism or gradient checkpointing to reduce memory usage, while others may rely on larger hardware configurations with more memory.
ME: I meant in the "specialized computing infrastructure provided by OpenAI" that you described earlier. Is that all on-line?
ChatGPT: Ah, I see what you're asking now. Yes, in the specialized computing infrastructure provided by OpenAI, the GPT-3 model with 175 billion parameters is fully loaded into the RAM of the computing cluster during operation. The computing cluster used by OpenAI is composed of many machines with large amounts of RAM, and the model parameters are distributed across those machines.
However, it's worth noting that not all of the machines in the cluster will necessarily have a full copy of the model parameters in memory at all times. Instead, the model parameters may be partitioned across the machines and loaded into memory as needed for different parts of the computation. This approach allows the computing cluster to handle the large memory requirements of the GPT-3 model while also being scalable and efficient.
ME: I realize that memory mapping is very rapid with today's peripherals, so even an overlay configuration will be rapid.
ChatGPT: Yes, memory mapping can be very fast with modern storage devices and is commonly used to efficiently access large datasets that cannot fit entirely in memory. With memory mapping, the operating system maps a portion of a file into the virtual address space of a process, allowing the process to access the file's contents as if they were in memory.
However, it's important to note that memory-mapped I/O does have some overhead, such as page faults and disk I/O. Page faults can occur if the requested memory page is not in physical memory and must be fetched from disk, while disk I/O can occur if the requested page has not yet been read into the page cache. These overheads can be minimized through various techniques such as preloading data into the page cache, using large pages, and optimizing disk access patterns.
In any case, the actual performance of memory-mapped I/O depends on the specific hardware and software configuration, as well as the characteristics of the workload being performed.
ME: One more question about this. I read that the "specialized computing infrastructure provided by OpenAI" is an MPP configuration. Approximately how many processors are there? How many processors per parameter? I guess one GPU or TPU is equivalent to one CPU processor in a traditional MPP configuration.
ChatGPT: I'm sorry, but I don't have access to detailed information about the specific hardware and configuration of OpenAI's computing infrastructure, so I can't provide specific numbers about the number of processors or GPUs/TPUs used per parameter.
However, I can provide some general information about MPP (Massively Parallel Processing) configurations and their relationship to GPUs/TPUs.
In an MPP configuration, the processing power of a system is distributed across multiple processors or nodes, allowing for massive parallelism and high throughput. This is typically achieved through the use of specialized hardware and software, such as high-speed interconnects and parallel file systems.
In the context of deep learning, GPUs and TPUs are often used as the primary processing units in MPP configurations due to their ability to perform large-scale matrix operations in parallel. GPUs and TPUs are designed with a large number of processing cores that can perform these operations simultaneously, making them well-suited for deep learning workloads.
The number of processors or GPUs/TPUs used per parameter will depend on various factors, such as the size of the model, the complexity of the workload, and the hardware configuration. In general, larger models and more complex workloads will require more processing power per parameter, while smaller models and simpler workloads may be able to make do with less.
ME: So, a larger model actually needs a higher ratio of GPU or TPU units per parameter? A model is "larger" in the sense that it does more processing for each parameter, not that it has more parameters. That is interesting.
ChatGPT: Yes, you're correct. The number of processing units, whether CPUs, GPUs or TPUs, required per parameter in a deep learning model depends not only on the number of parameters but also on the complexity of the computation required for each parameter.
Larger models, in terms of the computation required for each parameter, will require more processing units per parameter, as the computation becomes more computationally intensive. This is especially true for models that require more complex operations, such as recurrent neural networks, convolutional neural networks, and transformers like GPT-3.
In general, the number of processing units required per parameter is proportional to the computational requirements of the model. Therefore, larger models require a higher ratio of processing units per parameter, which means that more powerful hardware configurations, such as clusters of GPUs or TPUs, are needed to efficiently train or inference these models.