Epoch is a student initiative to restore and manage a powerful, previously abandoned computational cluster, making advanced computing resources accessible for machine learning and other intensive projects. Its goal is to revive the cluster, upgrade its capabilities, and educate students on effectively using it for research and learning.
I launched the Epoch AI initiative in 2024, drawing inspiration from the original Epoch team, which was discontinued in 2022. As the founding President, I led efforts to restore the computing cluster and make it accessible and useful for students. I now serve on the alumni board, continuing to support the project’s long-term success.
The Epoch AI Cluster
You can learn more about Epoch through the website: https://epochml.org
Slurm
We use Slurm as our job scheduler, which allows users to submit jobs to the cluster and manage resources efficiently. It provides a robust framework for scheduling tasks and optimizing resource allocation.
Proxmox
Proxmox provides virtualization and container management for the cluster, allowing us to run both virtual machines and lightweight containers. This setup enhances flexibility, isolation, and resource utilization across diverse workloads.
Hardware
The cluster is built on servers equipped with Intel Xeon CPUs for robust parallel processing and NVIDIA GPUs for accelerated machine learning and scientific computing tasks.
Docker
Docker is used to containerize applications and environments, ensuring consistency and portability for user workloads. This simplifies dependency management and deployment across the cluster.
Kubernetes
Kubernetes is deployed to orchestrate Docker containers at scale, enabling automated deployment, scaling, and management of containerized applications across the cluster.
Faces blurred for privacy.
I’m truly grateful to have had the opportunity to build this team, and I hope it will continue supporting students with their machine learning needs for many years ahead. I’m excited to see how it grows and evolves in the future.