A Day in the Life of a Data Scientist Conquer Machine Learning Lifecycle on Kubernetes
GPU or CPU nodes • Massive Scale • OpenAI dedicates up to 10k cores for a single experiment • Autoscaling capabilities: Pay for what you use, scale down when idle • Parallel training instead of sequential: Spin up pods for each variation of hyperparameters • One centralized TensorBoard instance • Autoscaling will create / remove VMs as needed to save cost Demo: Create End to End ML Pipelines with Argo Distributed File Systems • NFS • HDFS • … Classic DevOps solutions: • Containers • CI/CD • Autoscaling • A/B testing and canary release of Models • Comparing Production accuracy vs expected accuracy0 码力 | 21 页 | 68.69 MB | 1 年前3
共 1 条
- 1