from http://docs.continuum.io/anaconda-cluster/examples/spark-caffe
Deep Learning (Spark, Caffe, GPU)
Description
To demonstrate the capability of running a distributed job in PySpark using a GPU, this example uses a neural network library, Caffe. Below is a trivial example of using Caffe on a Spark cluster; although this is redundant, it demonstrates the capability of training neural networks with GPUs.
For this example, we recommend the use of the AMI ami-2cbf3e44
and the instance type g2.2xlarge
. An example profile (to be placed in ~/.acluster/profiles.d/gpu_profile.yaml
) is shown below:
name: gpu_profile
node_id: ami-2cbf3e44 # Ubuntu 14.04 - IS HVM - Cuda 6.5
user: ubuntu
node_type: g2.2xlarge
num_nodes: 3
provider: aws
plugins:
- spark-yarn
- notebook
Download
Installation
The Spark + YARN plugin can be installed on the cluster using the following command:
$ acluster install spark-yarn
Once the Spark + YARN plugin is installed, you can view the YARN UI in your browser using the following command:
Dependencies
First, we need to bootstrap Caffe and its dependencies on all of the nodes. We provide a bash script that will install Caffe from source: bootstrap-caffe.sh
. The following command can be used to upload the bootstrap-caffe.sh
script to all of the nodes and execute it in parallel:
$ acluster submit bootstrap-caffe.sh --all
After a few minues, Caffe and its dependencies will be installed on the cluster nodes and the job can be started.
Running the Job
Here is the complete script to run the Spark + GPU with Caffe example in PySpark:
# spark-caffe.py from pyspark import SparkConf from pyspark import SparkContext conf = SparkConf() conf.setMaster('yarn-client') conf.setAppName('spark-caffe') sc = SparkContext(conf=conf) def noop(x): import socket return socket.gethostname() rdd = sc.parallelize(range(2), 2) hosts = rdd.map(noop).distinct().collect() print hosts def caffe_process(x): import os os.environ['PATH'] = '/usr/local/cuda/bin' + ':' + os.environ['PATH'] os.environ['LD_LIBRARY_PATH'] = '/usr/local/cuda/lib64:/home/ubuntu/pombredanne-https-gitorious.org-mdb-mdb.git-9cc04f604f80/libraries/liblmdb' import subprocess proc = subprocess.Popen('cd /home/ubuntu/caffe && bash ./examples/mnist/train_lenet.sh', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) out, err = proc.communicate() return proc.returncode, out, err rdd = sc.parallelize(range(2), 2) ret = rdd.map(caffe_process).distinct().collect() print ret
You can submit the script to the Spark cluster using the submit
command.
$ acluster submit spark-caffe.py
After the script completes, the trained Caffe model can be found at/home/ubuntu/caffe/examples/mnist/lenet_iter_10000.caffemodel
on all of the compute nodes.
posted on 2015-10-14 17:25
爬 阅读(3554)
评论(1) 编辑 收藏 引用 所属分类:
life 、
关于人工智能的yy