├── Graphs └── thrift-architecture.png ├── markdown ├── 9-AllxuioBugAndFixBug.md ├── 1-Build-And-Deploy.md ├── 2-HowToUseAlluxio.md ├── 4-AlluxioBlockWrite.md └── 3-AlluxioRPC.md └── README.md /Graphs/thrift-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gjhkael/Alluxio-Internal/HEAD/Graphs/thrift-architecture.png -------------------------------------------------------------------------------- /markdown/9-AllxuioBugAndFixBug.md: -------------------------------------------------------------------------------- 1 | #Alluxio bug and fix bug 2 | 3 | ###1.Hive on alluixo 权限问题 4 | 如果hive使用kerberos来保证数据的安全问题,那么hive on alluxio会出现各种权限方面的问题。不过这在1.4.0已经解决,详情请看:[Hive permission issue fix](https://github.com/Alluxio/alluxio/pull/4453) 5 | 6 | ###spark on yarn on alluxio权限问题 7 | 第一个问题解决后,这个问题也随之解决。 8 | 9 | ###2.alluxio数据安全问题 10 | alluxio现在使用的sample方式来管理数据,用户可以设置alluxio.security.login.username来指定任何用户,从而来操作该用户的数据。这边后期有机会为alluxio 11 | 增加基于kerberos的安全认证,有兴趣的同学可以联系我,一起做。 12 | 13 | ###3.alluxio多副本问题 14 | alluxio数据只会在内存中存在一份,这样会导致mapreduce on alluxio的时候,因为没有数据备份,所以locality不如 mr on hdfs。 15 | 请见这个[mailing list](https://groups.google.com/forum/?fromgroups=#!topic/alluxio-users/Jmz_DmVLVjU) 16 | ###4.alluxio基于文件夹的ttl设置 17 | 这个问题在1.5.0会发布,见[Add ttl directory function](https://github.com/Alluxio/alluxio/pull/4458) 18 | ###5.alluxio makeConsistency命令 19 | 这个问题已经提了PR见[Make consistency](https://github.com/Alluxio/alluxio/pull/4686) 20 | 21 | 22 | 23 | 24 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Alluxio-Internal 2 | 经过近期对alluxio的研究,本着自身学习记录,同时因为当前对alluxio似乎还没有比较全面、深入的分析。这边系统的对alluxio进行源码分析,同时给出在工作过程中遇到的坑,和解决的坑。alluxio使用的源码是1.4.0。 3 | 4 | ##简单介绍 5 | 6 | alluxio和hdfs有些类似、都是分布式的文件系统,hdfs基于磁盘介质存储、alluxio基于内存介质存储;hdfs基于replica方式进行容错、alluixo基于lineage的方式进行容错(目前容错性处于test阶段,并不完善,建议重要数据还是需要持久化到底层的文件系统);alluxio和hdfs都有类似的文件操作api、类似的shell命令(目前alluixo并没分admin和非admin命令);alluxio和hdfs都是基于文件块的形式存储数据,都是典型的master slave集群架构,都有master ha等。同时它们也有很多细节的区别:alluixo RPC使用thrift,hdfs使用的是protobuf;alluxio更多的用来加速上层计算框架,hdfs则更多的用来持久化存储;alluxio对上层的计算框架的locality不如hdfs等。 7 | 8 | ##主要内容 9 | 对alluixo进行比较全面的分析,将从以下几个方面着手。 10 | 11 | 1. [Build And Deploy](https://github.com/gjhkael/Alluxio-Internal/blob/master/markdown/1-Build-And-Deploy.md) 编译部署alluxio 12 | 2. [How to use alluxio](https://github.com/gjhkael/Alluxio-Internal/blob/master/markdown/2-HowToUseAlluxio.md) alluxio的使用 13 | 3. [Alluxio RPC](https://github.com/gjhkael/Alluxio-Internal/blob/master/markdown/3-AlluxioRPC.md) alluxio RPC底层thrift介绍 14 | 4. [Allxuio Block](https://github.com/gjhkael/Alluxio-Internal/blob/master/markdown/4-AlluxioBlockWrite.md) alluxio block存储与管理 15 | 5. [Alluxio Client](https://github.com/gjhkael/Alluxio-Internal/blob/master/markdown/1-Build-And-Deploy.md) alluxio Client源码分析 16 | 6. [Alluxio Master](https://github.com/gjhkael/Alluxio-Internal/blob/master/Build-And-Deploy.md) alluxio Master源码分析(未写) 17 | 7. [Alluxio Worker](https://github.com/gjhkael/Alluxio-Internal/blob/master/Build-And-Deploy.md) alluxio Worker源码分析(未写) 18 | 8. [Alluxio security](https://github.com/gjhkael/Alluxio-Internal/blob/master/Build-And-Deploy.md) alluxio 认证授权源码分析 (未写) 19 | 9. [Alluxio bug and fix bug](https://github.com/gjhkael/Alluxio-Internal/blob/master/markdown/9-AllxuioBugAndFixBug.md) 挖坑、填坑 20 | 10. [Alluxio + kerberos](https://github.com/gjhkael/Alluxio-Internal/blob/master/Build-And-Deploy.md)(未写) 21 | -------------------------------------------------------------------------------- /markdown/1-Build-And-Deploy.md: -------------------------------------------------------------------------------- 1 | #Build-and-install-Alluxio 2 | 3 | ##编译alluxio 4 | 5 | 编译alluxio很简单如下: 6 | ``` 7 | git clone https://github.com/Alluxio/alluxio.git 8 | git checkout -t origins/branch-1.4 9 | mvn clean package -Pspark -Dhadoop.version=2.6.0 -DskipTests 10 | ``` 11 | 补充说明: 12 | 13 | * 这种开源多分支项目,可以先使用git branch -a看一下所有的分支、然后使用checkout -t来切换到某个远程分支,有兴趣的同学可以自己百度如何切换到远程分支。 14 | * 如果对alluxio源码进行了修改,编译之后将assembly下的alluxio jar替换线上集群相应的jar即可,不需要完全替换整个alluxio工程。PS(使用alluxio-start.sh all NoMount启动集群,集群内存中数据不会丢失)。 15 | 16 | ##deploy alluxio 17 | 18 | * 安装ssh并配置master到worker的无密登录(PS:配置ssh无密并不是alluxio本身工作需要无密登录,只是方便将文件进行远程拷贝(pssh使用),同时alluxio-start.sh all命令会用到conf下的slave host去远程启动worker进程) 19 | * 对alluxio进行配置 20 | 21 | 22 | 先创建一个文件夹用来管理alluxio内存: 23 | ```sh 24 | sudo mount -t ramfs -o size=100G ramfs /opt/ramfs/ramdisk 25 | sudo chown op1:op1 /opt/ramfs/ramdisk 26 | ``` 27 | 28 | ```sh 29 | #alluxio-env.sh 30 | export ALLUXIO_HOME=/path/to/your/alluxio 31 | export JAVA_HOME=/path/to/your/java 32 | export ALLUXIO_MASTER_HOSTNAME=master hostname 33 | export ALLUXIO_LOGS_DIR=/opt/log/alluxio 34 | export ALLUXIO_RAM_FOLDER=/opt/ramfs/ramdisk 35 | export ALLUXIO_WORKER_MEMORY_SIZE=30GB 36 | ``` 37 | 38 | ```sh 39 | #alluxio-site.properties 40 | alluxio.security.authorization.permission.supergroup=hadoop 41 | alluxio.security.authentication.type=SIMPLE 42 | alluxio.security.authorization.permission.enabled=true 43 | alluxio.security.login.username=gl 44 | 45 | alluxio.user.file.write.location.policy.class=alluxio.client.file.policy.RoundRobinPolicy 46 | alluxio.keyvalue.enabled=true 47 | 48 | alluxio.network.thrift.frame.size.bytes.max=1024MB 49 | alluxio.user.file.readtype.default=CACHE_PROMOTE 50 | alluxio.user.file.writetype.default=MUST_CACHE 51 | 52 | #Tiered Storage 53 | alluxio.worker.tieredstore.levels=2 54 | alluxio.worker.tieredstore.level0.alias=MEM 55 | alluxio.worker.tieredstore.level0.dirs.path=/opt/ramfs/ramdisk 56 | alluxio.worker.tieredstore.level0.dirs.quota=30GB 57 | alluxio.worker.tieredstore.level0.reserved.ratio=0.2 58 | alluxio.worker.tieredstore.level1.alias=HDD 59 | alluxio.worker.tieredstore.level1.dirs.path=/opt/data/alluxiodata 60 | alluxio.worker.tieredstore.level1.dirs.quota=2TB 61 | alluxio.worker.tieredstore.level1.reserved.ratio=0.1 62 | 63 | ``` 64 | 如上配置已经配置了alluxio的安全性指标,如果该配置在master,那么alluxio.security.login.username指定的用户为alluxio的superuser。alluxio.user.file.write.location.policy.class 文件写block位置选择策略,使用的四种之一的RoundRobin,alluxio的块定位策略存在严重的不足 65 | ,在源码分析会进行指出,指定了user file 读写使用的策略,源码分析详细讨论,并对alluxio进行了两级存储配置,内存不足的时候数据会存放到底层慢介质,这边会有一个文件evict算法,再聊。 66 | 67 | 68 | * 将配置好的alluxio使用scp复制到没一台worker节点 69 | 70 | ##alluxio on secure hdfs (假设已经搭建好了hadoop,本人使用cdh版本的hadoop2.6.0-cdh5.7.1) 71 | 在alluxio-env.sh加入如下配置 72 | ```sh 73 | export ALLUXIO_UNDERFS_ADDRESS=hdfs://masterhostname:port 74 | 75 | ``` 76 | 在alluxio-site.xml中加入如下配置 77 | ```sh 78 | alluxio.master.keytab.file=/path/to/your/keytab 79 | alluxio.master.principal=keytab content 80 | alluxio.worker.keytab.file=/path/to/your/keytab 81 | alluxio.worker.principal=keytab content 82 | 83 | ``` 84 | 85 | ##mapreduce on alluxio 86 | * 在hadoop的core-site.xml中加入如下配置(每台节点) 87 | 88 | ``` 89 | 90 | 91 | fs.alluxio.impl 92 | alluxio.hadoop.FileSystem 93 | 94 | 95 | fs.alluxio-ft.impl 96 | alluxio.hadoop.FaultTolerantFileSystem 97 | The Alluxio FileSystem (Hadoop 1.x and 2.x) with fault tolerant support 98 | 99 | 100 | fs.AbstractFileSystem.alluxio.impl 101 | alluxio.hadoop.AlluxioFileSystem 102 | The Alluxio AbstractFileSystem (Hadoop 2.x) 103 | 104 | 105 | ``` 106 | * copy alluxio-client jar到hadoop classpath中 107 | 108 | ``` 109 | cp alluxio/core/client/target/alluxio-core-client-1.4.0-jar-with-dependencies.jar /path/to/your/hadoop/lib/ 110 | ``` 111 | 112 | * copy 测试数据到alluxio 113 | 114 | ``` 115 | ./bin/alluxio fs copyFromLocal LICENSE /wordcount/input.txt 116 | ``` 117 | 118 | * 运行mr程序进行验证 119 | 120 | ``` 121 | bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.1.jar wordcount alluxio://your master ip:19998/wordcount/input.txt alluxio://your master ip:19998/wordcount/output 122 | ``` 123 | 124 | 125 | ##spark on alluxio 126 | * 在spark-default.conf中加入 127 | 128 | ``` 129 | spark.executor.extraClassPath /opt/app/alluxio/core/client/target/alluxio-core-client-1.4.0-jar-with-dependencies.jar 130 | spark.driver.extraClassPath /opt/app/alluxio/core/client/target/alluxio-core-client-1.4.0-jar-with-dependencies.jar 131 | ``` 132 | 133 | * 如果在spark-env.sh中已经定义了spark-classpath参数,那么会和上述配置冲突,解决的方式如下:在spark-shell 或spark-submit 中通过 --jars /opt/app/alluxio/core/client/target/alluxio-core-client-1.4.0-jar-with-dependencies.jar 加入jar 134 | 135 | * spark conf中新建core-site.xml配置文件加入如下配置: 136 | 137 | ``` 138 | 139 | 140 | fs.alluxio.impl 141 | alluxio.hadoop.FileSystem 142 | 143 | 144 | fs.alluxio-ft.impl 145 | alluxio.hadoop.FaultTolerantFileSystem 146 | The Alluxio FileSystem (Hadoop 1.x and 2.x) with fault tolerant support 147 | 148 | 149 | fs.AbstractFileSystem.alluxio.impl 150 | alluxio.hadoop.AlluxioFileSystem 151 | The Alluxio AbstractFileSystem (Hadoop 2.x) 152 | 153 | 154 | ``` 155 | * 验证spark 156 | 157 | ``` 158 | bin/spark-shell --master yarn-client --jars /opt/app/alluxio/core/client/target/alluxio-core-client-1.4.0-jar-with-dependencies.jar 159 | val s = sc.textFile("alluxio://master ip:19998/tmp/alluxio/test.txt") 160 | val double= s.map(line=>line+line) 161 | double.saveAsTextFile("alluxio://master ip:19998/LICENSE2") 162 | ``` 163 | 164 | 165 | ##hive on alluxio 166 | * Hive on alluxio存在的问题很多,建议大家不要使用,等成熟时可以考虑。 167 | * 简单说一下如何配置,首先,将alluxio的client jar复制到hive/lib下,然后在hive-metastore节点上的hive-site.xml中添加如下配置 168 | 169 | ```sh 170 | 171 | fs.defaultFS 172 | alluxio://your master ip:19998 173 | 174 | 175 | fs.alluxio.impl 176 | alluxio.hadoop.FileSystem 177 | 178 | 179 | fs.alluxio-ft.impl 180 | alluxio.hadoop.FaultTolerantFileSystem 181 | The Alluxio FileSystem (Hadoop 1.x and 2.x) with fault tolerant support 182 | 183 | 184 | fs.AbstractFileSystem.alluxio.impl 185 | alluxio.hadoop.AlluxioFileSystem 186 | The Alluxio AbstractFileSystem (Hadoop 2.x) 187 | 188 | 189 | alluxio.user.file.writetype.default 190 | MUST_CACHE 191 | 192 | 193 | ``` 194 | 195 | ##alluxio ha 196 | 197 | * alluxio 的ha仍然存在一些bug [JIRA](https://alluxio.atlassian.net/browse/ALLUXIO-2439?filter=-4) 198 | * 详细配置以后给出 199 | 200 | 201 | 202 | -------------------------------------------------------------------------------- /markdown/2-HowToUseAlluxio.md: -------------------------------------------------------------------------------- 1 | #How to use Alluxio 2 | 3 | ##Alluxio Command 4 | 5 | - alluxio protoGen 当用户在修改了alluxio的protobuf文件的时候,需要使用这个命令来重新生成protobuf序列化程序 6 | - alluxio thriftGen 当用户修改了alluxio的thrift文件的时候,需要只用这个命令来重新生成PPC java程序 7 | - alluxio fs 用来操作alluxio中的文件 8 | - alluxio fs mount /allxuio/path hdfs://xxx/path 9 | - alluxio fs setTtl 设置文件的ttl,目前1.4还不支持文件夹的ttl设置,master已经merge了我写的这个patch,可以将master中[PR(click here)](https://github.com/Alluxio/alluxio/pull/4458) cherry-pick到自己的alluxio源码中 10 | - alluxio fs checkConsistency 可以查看底层文件系统和alluxio是否存在文件不一致问题。但是当前版本1.4.0没有提供解决一致性的命令。可以参考我提 11 | 的PR[ Make consistency ](https://github.com/Alluxio/alluxio/pull/4686)。 12 | - 其他的操作可以使用 alluxio fs来看具体的含义 13 | 14 | ##Alluxio API 15 | alluxio包装了hdfs的api,所以可以使用两种方式来读写alluxio文件: 16 | ###alluxio 17 | ```java 18 | public Boolean call() throws Exception { 19 | Configuration.set(PropertyKey.MASTER_HOSTNAME, "10.2.4.192"); 20 | Configuration.set(PropertyKey.MASTER_RPC_PORT, Integer.toString(19998)); 21 | FileSystem fs = FileSystem.Factory.get(); 22 | Configuration.set(PropertyKey.SECURITY_LOGIN_USERNAME, "hdfs"); 23 | System.out.println(Configuration.get(PropertyKey.MASTER_ADDRESS)); 24 | writeFile(fs); 25 | return readFile(fs); 26 | //writeFileWithAbstractFileSystem(); 27 | //return true; 28 | 29 | } 30 | 31 | private void writeFile(FileSystem fileSystem) 32 | throws IOException, AlluxioException { 33 | ByteBuffer buf = ByteBuffer.allocate(NUMBERS * 4); 34 | buf.order(ByteOrder.nativeOrder()); 35 | for (int k = 0; k < NUMBERS; k++) { 36 | buf.putInt(k); 37 | } 38 | LOG.debug("Writing data..."); 39 | long startTimeMs = CommonUtils.getCurrentMs(); 40 | if (fileSystem.exists(mFilePath)) { 41 | fileSystem.delete(mFilePath); 42 | } 43 | FileOutStream os = fileSystem.createFile(mFilePath, mWriteOptions); 44 | os.write(buf.array()); 45 | os.close(); 46 | System.out.println((FormatUtils.formatTimeTakenMs(startTimeMs, "writeFile to file " + mFilePath))); 47 | } 48 | 49 | private boolean readFile(FileSystem fileSystem) 50 | throws IOException, AlluxioException { 51 | boolean pass = true; 52 | LOG.debug("Reading data..."); 53 | final long startTimeMs = CommonUtils.getCurrentMs(); 54 | FileInStream is = fileSystem.openFile(mFilePath, mReadOptions); 55 | ByteBuffer buf = ByteBuffer.allocate((int) is.remaining()); 56 | is.read(buf.array()); 57 | buf.order(ByteOrder.nativeOrder()); 58 | for (int k = 0; k < NUMBERS; k++) { 59 | pass = pass && (buf.getInt() == k); 60 | System.out.print(pass); 61 | } 62 | is.close(); 63 | System.out.println(FormatUtils.formatTimeTakenMs(startTimeMs, "readFile file " + mFilePath)); 64 | return pass; 65 | } 66 | ``` 67 | 68 | ###hadoop 69 | ```java 70 | public void writeFileWithAbstractFileSystem() { 71 | org.apache.hadoop.conf.Configuration conf = new org.apache.hadoop.conf.Configuration(); 72 | conf.set("alluxio.master.hostname", "10.2.4.192"); 73 | conf.set("alluxio.master.port", "19998"); 74 | conf.set("alluxio.security.login.username", "gl"); 75 | conf.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem"); 76 | conf.set("alluxio.user.file.writetype.default", "CACHE_THROUGH"); 77 | try { 78 | Path p = new Path("alluxio://10.2.4.192:19998/tmp/test4"); 79 | org.apache.hadoop.fs.FileSystem fileSystem = p.getFileSystem(conf); 80 | 81 | 82 | System.out.println(System.currentTimeMillis()); 83 | fileSystem.mkdirs(p); 84 | System.out.println(System.currentTimeMillis()); 85 | 86 | } catch (Exception e) { 87 | e.printStackTrace(); 88 | } 89 | } 90 | 91 | ``` 92 | 93 | ##Alluxio KV 94 | alluxio支持kv数据类型的写,但是alluxio并不支持文件的追加写,所以kv数据类型并不支持文件追加写,同时要求k是排好序,并且不重复。可以说alluxio 95 | 的kv使用场景十分有限。 96 | 97 | ```java 98 | // alluxio.hadoop.AbstractFileSystem 99 | public FSDataOutputStream append(Path path, int bufferSize, Progressable progress) 100 | throws IOException { 101 | LOG.debug("append({}, {}, {})", path, bufferSize, progress); 102 | if (mStatistics != null) { 103 | mStatistics.incrementWriteOps(1); 104 | } 105 | AlluxioURI uri = new AlluxioURI(HadoopUtils.getPathWithoutScheme(path)); 106 | try { 107 | if (mFileSystem.exists(uri)) { 108 | throw new IOException(ExceptionMessage.FILE_ALREADY_EXISTS.getMessage(uri)); 109 | } 110 | return new FSDataOutputStream(mFileSystem.createFile(uri), mStatistics); 111 | } catch (AlluxioException e) { 112 | throw new IOException(e); 113 | } 114 | } 115 | 116 | ``` 117 | 下面来看个demo: 118 | ```java 119 | public static void main(String[] args) throws Exception { 120 | if (args.length != 1) { 121 | System.out.println("You must give me the file path"); 122 | System.exit(-1); 123 | } 124 | Configuration.set(PropertyKey.MASTER_HOSTNAME,"you hostname"); 125 | AlluxioURI storeUri = new AlluxioURI(args[0]); 126 | KeyValueSystem kvs = KeyValueSystem.Factory.create(); 127 | 128 | // Creates a store. 129 | KeyValueStoreWriter writer = kvs.createStore(storeUri); 130 | 131 | // Puts a key-value pair ("key", "value"). 132 | String key = "key"; 133 | String value = "value"; 134 | writer.put(key.getBytes(), value.getBytes()); 135 | System.out.println(String.format("(%s, %s) is put into the key-value store", key, value)); 136 | 137 | // Completes the store. 138 | writer.close(); 139 | 140 | // Opens a store. 141 | KeyValueStoreReader reader = kvs.openStore(storeUri); 142 | 143 | // Gets the value for "key". 144 | System.out.println(String.format("Value for key '%s' got from the store is '%s'", key, 145 | new String(reader.get(key.getBytes())))); 146 | 147 | // Closes the reader. 148 | reader.close(); 149 | //接下来的操作会导致文件已存在的异常。 150 | AlluxioURI storeUri = new AlluxioURI(args[0]); 151 | KeyValueSystem kvs = KeyValueSystem.Factory.create(); 152 | // Creates a store. 153 | KeyValueStoreWriter writer = kvs.createStore(storeUri); 154 | // Puts a key-value pair ("key1", "value1"). 155 | String key = "key1"; 156 | String value = "value1"; 157 | writer.put(key.getBytes(), value.getBytes()); 158 | } 159 | 160 | ``` 161 | 162 | 163 | ##Alluxio Write Type 164 | alluxio.user.file.writetype.default 配置用来设置alluxio write的类型,总共5种: 165 | ###MUST_CACHE 166 | 如果用户使用MUST_CACHE,文件只会在alluxio的内存层,并不会发生持久化的过程。这种写方式是效率最高的,但是数据的容错性就没办法保证。 167 | alluxio没有多副本机制,所以worker crash会导致数据的丢失。 168 | ###TRY_CACHE 169 | 在最新的版本中以及放弃了这种写类型。 170 | ###CACHE_THROUGH 171 | 数据写alluxio的时候,同步到底层文件系统,我们知道数据写到alluxio的时候同时又要再写到底层文件系统,效率肯定会比直接写hdfs差。所以 172 | 这边用户在选择这个写类型的时候要有所考量。 173 | ###THROUGH 174 | 直接跳过alluxio,直接写到底层文件系统。这种方式和直接写hdfs一样。 175 | ###ASYNC_THROUGH 176 | 数据写alluxio的时候,异步写到底层文件系统,这种方式仍然存在丢失数据的风险。 177 | 178 | ##Alluxio Read Type 179 | alluxio文件读有3中策略。分别是:NO_CACHE、CACHE、CACHE_PROMOTE 180 | 181 | ###NO_CACHE 182 | 如果用户使用了alluxio.user.file.readtype.default=NO_CACHE 来指定alluxio的读类型,那么alluxio会跨过alluxio内存,直接从底层文件系统读取数据。 183 | 184 | ###CACHE 185 | 该策略会将底层文件系统的数据加载到alluxio存储层,但是alluxio多级存储的情况不会发生数据跨级移动。也就是说,直接从文件所属层访问数据 186 | 187 | ###CACHE_PROMOTE 188 | 该策略会读数据并且将底层文件系统的数据写入alluxio内存层,同时内存不够的时候会发生文件的置换。 189 | 190 | ##Alluxio FileWritePolicy 191 | alluxio 总共有4种文件write,block位置选择策略。文件在写入到alluxio分布式内存文件系统时,和HDFS一样,都是以Block的形式来存储数据,因此Block的位置 192 | 可以以不同的方式来决定存放在那台机器上。 193 | 194 | ###LocalFirstPolicy 195 | 不多说,直接上源码,每个文件写入的的时候,都会调用getWorkerForNextBlock方法来获取下一个block的存放地址,返回hostIp. 196 | 该策略是优先选择local,当local的CapacityBytes(容量)不满足block的时候,会选择其他worker。 197 | ```java 198 | // alluxio.client.file.policy.LocalFirstPolicy 199 | public WorkerNetAddress getWorkerForNextBlock(Iterable workerInfoList, 200 | long blockSizeBytes) { 201 | // first try the local host 202 | for (BlockWorkerInfo workerInfo : workerInfoList) { 203 | if (workerInfo.getNetAddress().getHost().equals(mLocalHostName) 204 | && workerInfo.getCapacityBytes() >= blockSizeBytes) { 205 | return workerInfo.getNetAddress(); 206 | } 207 | } 208 | 209 | // otherwise randomly pick a worker that has enough availability 210 | List shuffledWorkers = Lists.newArrayList(workerInfoList); 211 | Collections.shuffle(shuffledWorkers); 212 | for (BlockWorkerInfo workerInfo : workerInfoList) { 213 | if (workerInfo.getCapacityBytes() >= blockSizeBytes) { 214 | return workerInfo.getNetAddress(); 215 | } 216 | } 217 | return null; 218 | } 219 | 220 | ``` 221 | 注意点: 222 | 223 | - 1.上述判断条件是CapacityBytes,而不是avalibaleBytes,所以当block size小于CapacityBytes的时候,不管当前worker是否还有空间,都会返回 224 | 当前local。当空间不够的时候会发生文件置换的现象,使用的策略是LRU。 225 | 226 | - 2.workerInfo中的UsedBytes并不是实时更新,只有当文件写完后才会更新这个变量。所以我提了一个[PR](https://github.com/Alluxio/alluxio/pull/4445) 来解决这个问题,但是因为UsedBytes不是 227 | 实时更新,所以需要配置其他参数来预留足够的空间来存储当前文件。 228 | 229 | ###MostAvailableFirstPolicy 230 | 代码很简单,只是每次获取block的地址时,使用最大容量减去使用容量剩下最多的woker。 231 | ```java 232 | // alluxio.client.file.policy.LocalFirstPolicy 233 | public WorkerNetAddress getWorkerForNextBlock(Iterable workerInfoList, 234 | long blockSizeBytes) { 235 | long mostAvailableBytes = -1; 236 | WorkerNetAddress result = null; 237 | for (BlockWorkerInfo workerInfo : workerInfoList) { 238 | if (workerInfo.getCapacityBytes() - workerInfo.getUsedBytes() > mostAvailableBytes) { 239 | mostAvailableBytes = workerInfo.getCapacityBytes() - workerInfo.getUsedBytes(); 240 | result = workerInfo.getNetAddress(); 241 | } 242 | } 243 | return result; 244 | } 245 | ``` 246 | 注意点: 247 | 248 | - 如果数据在一台节点上,使用这种方式,无疑会增加系统的写延迟,因为需要走网络,而LocalFirst则只会发生内存的拷贝,相对来说更加的高效, 249 | 所以如果有计算引擎架构在alluxio之上的时候,因为本身数据较为均匀的分布在各个节点,所以LocalFist要比其他策略高效。 250 | 251 | 252 | ###RoundRobinPolicy 253 | ```java 254 | // alluxio.client.file.policy.RoundRobinPolicy 255 | public WorkerNetAddress getWorkerForNextBlock(Iterable workerInfoList, 256 | long blockSizeBytes) { 257 | if (!mInitialized) { 258 | mWorkerInfoList = Lists.newArrayList(workerInfoList); 259 | Collections.shuffle(mWorkerInfoList); 260 | mIndex = 0; 261 | mInitialized = true; 262 | } 263 | 264 | // at most try all the workers 265 | for (int i = 0; i < mWorkerInfoList.size(); i++) { 266 | WorkerNetAddress candidate = mWorkerInfoList.get(mIndex).getNetAddress(); 267 | BlockWorkerInfo workerInfo = findBlockWorkerInfo(workerInfoList, candidate); 268 | mIndex = (mIndex + 1) % mWorkerInfoList.size(); 269 | if (workerInfo != null && workerInfo.getCapacityBytes() >= blockSizeBytes) { 270 | return candidate; 271 | } 272 | } 273 | return null; 274 | } 275 | ``` 276 | 很简单,每次都是随机选择一个满足条件的worker 277 | 278 | ###SpecificHostPolicy 279 | ```java 280 | // alluxio.client.file.policy.SpecificHostPolicy 281 | public WorkerNetAddress getWorkerForNextBlock(Iterable workerInfoList, 282 | long blockSizeBytes) { 283 | // find the first worker matching the host name 284 | for (BlockWorkerInfo info : workerInfoList) { 285 | if (info.getNetAddress().getHost().equals(mHostname)) { 286 | return info.getNetAddress(); 287 | } 288 | } 289 | return null; 290 | } 291 | 292 | ``` 293 | 294 | ##总结 295 | 296 | 总的来说,alluxio的使用方面还是比较简单,如果熟悉HDFS的,很快就能熟练Alluxio的使用。 297 | 298 | -------------------------------------------------------------------------------- /markdown/4-AlluxioBlockWrite.md: -------------------------------------------------------------------------------- 1 | #Alluxio Block 存储 2 | 我们知道alluxio是基于内存的分布式文件系统,同样的alluxio也是基于block形式来管理数据的 3 | ,那么alluxio是如何存储数据?用什么存储数据?如何管理block的呢? 4 | 5 | 接下来,我们将通过常用的文件写的demo来讲解,数据是如何一步一步的封装成block,然后如何 6 | 写入worker的。至于master和worker之间的通信、心跳以及如何通过master查找文件,在后续的章节进行 7 | 分析。 8 | 9 | ##文件写 Demo 10 | 首先上来先上一个write文件的demo,让大家从抽象层API来体会一下文件的写入操作。 11 | ```java 12 | //alluxio.client.file.BaseFileSystem 或者通过ctrl + N查找类 13 | private void writeFile(FileSystem fileSystem) 14 | throws IOException, AlluxioException { 15 | ByteBuffer buf = ByteBuffer.allocate(NUMBERS * 4); 16 | buf.order(ByteOrder.nativeOrder()); 17 | for (int k = 0; k < NUMBERS; k++) { 18 | buf.putInt(k); 19 | } 20 | LOG.debug("Writing data..."); 21 | long startTimeMs = CommonUtils.getCurrentMs(); 22 | if (fileSystem.exists(mFilePath)) { 23 | fileSystem.delete(mFilePath); 24 | } 25 | FileOutStream os = fileSystem.createFile(mFilePath, mWriteOptions); 26 | os.write(buf.array());LineageFileSystem 27 | os.close(); 28 | System.out.println((FormatUtils.formatTimeTakenMs(startTimeMs, "writeFile to file " + mFilePath))); 29 | } 30 | 31 | ``` 32 | 构造好FileSystem之后,就可以使用fileSystem来做相应的操作了,createFile这个API会先调用client向master发送createFile的操作 33 | ,master在接受到请求后向mInodes插入一个InodeFile对象,从而表明该文件创建成功,如果出现其它异常的话,会抛出相应的异常信息。下面具体来看源码: 34 | 35 | ##BaseFileSystem 36 | 首先来看FileSystem的createFile API,由于FileSystem是接口类,实现它的类只有一个BaseFileSystem,继承BaseFileSystem的是LineageFileSystem,但是LineageFileSystem 37 | 还是实验性的东西,不建议大家开启这个特性,[详情点击](http://www.alluxio.org/docs/master/en/Lineage-API.html)。所以目前alluxio client都是通过BaseFileSystem 38 | 来进行操作的。 39 | ```java 40 | //alluxio.client.file.FileSystemMasterClient 或者通过ctrl + N查找类 41 | public FileOutStream createFile(AlluxioURI path, CreateFileOptions options) 42 | throws FileAlreadyExistsException, InvalidPathException, IOException, AlluxioException { 43 | FileSystemMasterClient masterClient = mFileSystemContext.acquireMasterClient(); 44 | try { 45 | masterClient.createFile(path, options); 46 | LOG.debug("Created file " + path.getPath()); 47 | } finally { 48 | mFileSystemContext.releaseMasterClient(masterClient); 49 | } 50 | return new FileOutStream(path, options.toOutStreamOptions()); 51 | } 52 | ``` 53 | 如上API,createFile之前先通过masterClient向master发送createFile请求,而这个过程只是发生元数据的存储,实际数据并没有发生写过程,只有元数据写成功之后, 54 | 才会new 一个FileOutStream用来真正的写数据,如果有兴趣了解createFile这个RPC请求的可以参考第三篇RPC章节,这章详细的通过mkdir 这个命令来介绍了alluxio client 55 | 向master发送RPC请求的完整过程。这里直接略过。 56 | 57 | ##FileOutStream 58 | 通过path和outStreamOtions构造FileOutStream后,可以真正的通过write api来写数据了。 59 | 60 | 首先来看看outStreamOtions具有那些选项: 61 | ```java 62 | //alluxio.client.file.options.OutStreamOptions 或者通过ctrl + N查找类 63 | public final class OutStreamOptions { 64 | private long mBlockSizeBytes; 65 | private long mTtl; 66 | private TtlAction mTtlAction; 67 | private FileWriteLocationPolicy mLocationPolicy; 68 | private WriteType mWriteType; 69 | private Permission mPermission; 70 | 71 | /** 72 | * @return the default {@link OutStreamOptions} 73 | */ 74 | public static OutStreamOptions defaults() { 75 | return new OutStreamOptions(); 76 | } 77 | 78 | private OutStreamOptions() { 79 | mBlockSizeBytes = Configuration.getBytes(PropertyKey.USER_BLOCK_SIZE_BYTES_DEFAULT); 80 | mTtl = Constants.NO_TTL; 81 | mTtlAction = TtlAction.DELETE; 82 | 83 | try { 84 | mLocationPolicy = CommonUtils.createNewClassInstance( 85 | Configuration.getClass( 86 | PropertyKey.USER_FILE_WRITE_LOCATION_POLICY), new Class[] {}, new Object[] {}); 87 | } catch (Exception e) { 88 | throw Throwables.propagate(e); 89 | } 90 | mWriteType = Configuration.getEnum(PropertyKey.USER_FILE_WRITE_TYPE_DEFAULT, WriteType.class); 91 | mPermission = Permission.defaults(); 92 | try { 93 | // Set user and group from user login module, and apply default file UMask. 94 | mPermission.applyFileUMask().setOwnerFromLoginModule(); 95 | } catch (IOException e) { 96 | // Fall through to system property approach 97 | } 98 | } 99 | } 100 | ``` 101 | 可以看到,Options包含了mBlockSizeBytes、mTtll、mTtlAction、mLocationPolicy、mWriteType、mPermission这么几个参数。 102 | 103 | - blockSizeByte : block的大小,默认是512MB,可以通过alluxio.user.block.size.bytes.default参数来配置,这个属于client端设置。 104 | - ttl :文件或者文件夹的生命周期,超过这个生命周期,会被alluixo Master守护进程启动的定时任务清除掉。默认是-1s,表示永久保存。 105 | - ttlAction:表示ttl所触发的action操作,目前有两种,delete和free,delete会将文件或者文件夹删除连同底层文件系统一起删除,free只会将数据从alluxio中删除,元数据和 106 | 底层文件系统中的文件还在。大家可以通过alluxio fs命令来设置ttl,也可以在createFileOption里面设置。 107 | - locationPolicy:文件写时,block位置选择策略,默认是LocalFirstPolicy,可以通过alluxio.user.file.write.location.policy.class参数进行配置。 108 | - writeType:文件写类型,详情参考第一章节,包含多种写类型。可以通过alluxio.user.file.writetype.default参数设置。 109 | - permission:文件权限设置,可以通过alluxio.security.authorization.permission.umask来控制,默认是777。 110 | 111 | ok,现在回到正题,看看outputstream如何write数据: 112 | 113 | ```java 114 | //文件位置 alluxio.client.file.FileOutStream,或者通过ctrl + N查找类 115 | @Override 116 | public void write(byte[] b) throws IOException { 117 | Preconditions.checkArgument(b != null, PreconditionMessage.ERR_WRITE_BUFFER_NULL); 118 | write(b, 0, b.length); 119 | } 120 | 121 | @Override 122 | public void write(byte[] b, int off, int len) throws IOException { 123 | Preconditions.checkArgument(b != null, PreconditionMessage.ERR_WRITE_BUFFER_NULL); 124 | Preconditions.checkArgument(off >= 0 && len >= 0 && len + off <= b.length, 125 | PreconditionMessage.ERR_BUFFER_STATE.toString(), b.length, off, len); 126 | 127 | if (mShouldCacheCurrentBlock) { 128 | try { 129 | int tLen = len; 130 | int tOff = off; 131 | while (tLen > 0) { 132 | if (mCurrentBlockOutStream == null || mCurrentBlockOutStream.remaining() == 0) { 133 | getNextBlock(); 134 | } 135 | long currentBlockLeftBytes = mCurrentBlockOutStream.remaining(); 136 | if (currentBlockLeftBytes >= tLen) { 137 | mCurrentBlockOutStream.write(b, tOff, tLen); 138 | tLen = 0; 139 | } else { 140 | mCurrentBlockOutStream.write(b, tOff, (int) currentBlockLeftBytes); 141 | tOff += currentBlockLeftBytes; 142 | tLen -= currentBlockLeftBytes; 143 | } 144 | } 145 | } catch (IOException e) { 146 | handleCacheWriteException(e); 147 | } 148 | } 149 | 150 | if (mUnderStorageType.isSyncPersist()) { 151 | mUnderStorageOutputStream.write(b, off, len); 152 | Metrics.BYTES_WRITTEN_UFS.inc(len); 153 | } 154 | mBytesWritten += len; 155 | } 156 | ``` 157 | 可以看到write一个byte[]的时候,通过一些条件检查之后,首先调用getNextBlock()方法,然后调用mCurrentBlockOutStream来真正的将数据写到相应的worker中去的。 158 | 159 | ```java 160 | //文件位置 alluxio.client.file.FileOutStream,或者通过ctrl + N查找类 161 | private void getNextBlock() throws IOException { 162 | if (mCurrentBlockOutStream != null) { 163 | Preconditions.checkState(mCurrentBlockOutStream.remaining() <= 0, 164 | PreconditionMessage.ERR_BLOCK_REMAINING); 165 | mPreviousBlockOutStreams.add(mCurrentBlockOutStream); 166 | } 167 | 168 | if (mAlluxioStorageType.isStore()) { 169 | mCurrentBlockOutStream = 170 | mContext.getAlluxioBlockStore().getOutStream(getNextBlockId(), mBlockSize, mOptions); 171 | mShouldCacheCurrentBlock = true; 172 | } 173 | } 174 | ``` 175 | look,mCurrentBlockOutStream是通过mContext先get到AlluxioBlockStore,然后再get到outStream。 176 | ```java 177 | //alluxio.client.block.AlluxioBlockStore 或者通过ctrl + N查找类 178 | public BufferedBlockOutStream getOutStream(long blockId, long blockSize, OutStreamOptions options) 179 | throws IOException { 180 | WorkerNetAddress address; 181 | FileWriteLocationPolicy locationPolicy = Preconditions.checkNotNull(options.getLocationPolicy(), 182 | PreconditionMessage.FILE_WRITE_LOCATION_POLICY_UNSPECIFIED); 183 | try { 184 | address = locationPolicy.getWorkerForNextBlock(getWorkerInfoList(), blockSize); 185 | } catch (AlluxioException e) { 186 | throw new IOException(e); 187 | } 188 | return getOutStream(blockId, blockSize, address); 189 | } 190 | 191 | ``` 192 | 可以看到,这边根据locationPolicy会根据用户设置的文件写位置选择策略来返回worker的IP地址,将blockId,blockSize和address再调用getOutStream 193 | 从而返回具体的outStream。 194 | ```java 195 | public BufferedBlockOutStream getOutStream(long blockId, long blockSize, WorkerNetAddress address) 196 | throws IOException { 197 | if (blockSize == -1) { 198 | try (CloseableResource blockMasterClientResource = 199 | mContext.acquireMasterClientResource()) { 200 | blockSize = blockMasterClientResource.get().getBlockInfo(blockId).getLength(); 201 | } catch (AlluxioException e) { 202 | throw new IOException(e); 203 | } 204 | } 205 | // No specified location to write to. 206 | if (address == null) { 207 | throw new RuntimeException(ExceptionMessage.NO_WORKER_AVAILABLE.getMessage()); 208 | } 209 | // Location is local. 210 | if (mLocalHostName.equals(address.getHost())) { 211 | return new LocalBlockOutStream(blockId, blockSize, address, mContext); 212 | } 213 | // Location is specified and it is remote. 214 | return new RemoteBlockOutStream(blockId, blockSize, address, mContext); 215 | } 216 | ``` 217 | look,这边会返回一个BufferedBlockOutStream,如果当前节点的hostname和get过来的address的hostname相同,则返回LocalBlockOutStream,否则 218 | 返回RemoteBlockOutStream。 219 | 220 | 接下来重点讲讲RemoteBlockOutStream,因为RemoteBlockOutStream继承了BufferedBlockOutStream,所以接下里结合BufferedBlockOutStream进行源码的分析 221 | 222 | ##BufferedBlockOutStream 223 | FileOutputStream.write将byte[]写入alluxio最终调用的是RemoteBlockOutStream的write方法。所以先看write 224 | ```java 225 | //alluxio.client.block.BufferedBlockOutStream 226 | public void write(byte[] b, int off, int len) throws IOException { 227 | if (len == 0) { 228 | return; 229 | } 230 | 231 | // Write the non-empty buffer if the new write will overflow it. 232 | if (mBuffer.position() > 0 && mBuffer.position() + len > mBuffer.limit()) { 233 | flush(); 234 | } 235 | 236 | // If this write is larger than half of buffer limit, then write it out directly 237 | // to the remote block. Before committing the new writes, need to make sure 238 | // all bytes in the buffer are written out first, to prevent out-of-order writes. 239 | // Otherwise, when the write is small, write the data to the buffer. 240 | if (len > mBuffer.limit() / 2) { 241 | if (mBuffer.position() > 0) { 242 | flush(); 243 | } 244 | unBufferedWrite(b, off, len); 245 | } else { 246 | mBuffer.put(b, off, len); 247 | } 248 | 249 | mWrittenBytes += len; 250 | } 251 | ``` 252 | - 如果byte[]写入的len不大于mBuffer.limit()/2的话,直接将byte数组写入到mBuffer,mBuffer是ByteBuffer数据类型,默认值为1MB,可以通过 253 | alluxio.user.file.buffer.bytes参数进行设置。 254 | - 如果大于该值的话,就会触发flush()方法,该方法已经被RemoteBlockOutStream重写,接下来调用writeToRemoteBlock将数据写入指定的block中, 255 | 同时将mBuffer.clear();清空buffer。 256 | 257 | ```java 258 | private void writeToRemoteBlock(byte[] b, int off, int len) throws IOException { 259 | mRemoteWriter.write(b, off, len); 260 | mFlushedBytes += len; 261 | Metrics.BYTES_WRITTEN_REMOTE.inc(len); 262 | } 263 | ``` 264 | 最终通过mRemoteWriter来写, 265 | ``` 266 | public RemoteBlockOutStream(long blockId, 267 | long blockSize, 268 | WorkerNetAddress address, 269 | BlockStoreContext blockStoreContext) throws IOException { 270 | super(blockId, blockSize, blockStoreContext); 271 | mCloser = Closer.create(); 272 | try { 273 | mRemoteWriter = mCloser.register(RemoteBlockWriter.Factory.create()); 274 | mBlockWorkerClient = mCloser.register(mContext.createWorkerClient(address)); 275 | 276 | mRemoteWriter.open(mBlockWorkerClient.getDataServerAddress(), mBlockId, 277 | mBlockWorkerClient.getSessionId()); 278 | } catch (IOException e) { 279 | mCloser.close(); 280 | throw e; 281 | } 282 | } 283 | ``` 284 | 可以看到之前通过locationpolicy获得的address会传递给fileSystemConext来构造WorkerClient。同时RemoteWriter会open这个client。 285 | mRemoteWriter默认是NettyRemoteBlockWriter,可以通过alluxio.user.block.remote.writer.class来设置。 286 | 287 | 看看NettyRemoteBlockWriter如何写数据的 288 | 289 | ##NettyRemoteBlockWriter 290 | ```java 291 | public final class NettyRemoteBlockWriter implements RemoteBlockWriter { 292 | private static final Logger LOG = LoggerFactory.getLogger(Constants.LOGGER_TYPE); 293 | 294 | private final Callable mClientBootstrap; 295 | 296 | private boolean mOpen; 297 | private InetSocketAddress mAddress; 298 | private long mBlockId; 299 | private long mSessionId; 300 | 301 | // Total number of bytes written to the remote block. 302 | private long mWrittenBytes; 303 | 304 | /** 305 | * Creates a new {@link NettyRemoteBlockWriter}. 306 | */ 307 | public NettyRemoteBlockWriter() { 308 | mClientBootstrap = NettyClient.bootstrapBuilder(); 309 | mOpen = false; 310 | } 311 | 312 | @Override 313 | public void open(InetSocketAddress address, long blockId, long sessionId) throws IOException { 314 | if (mOpen) { 315 | throw new IOException( 316 | ExceptionMessage.WRITER_ALREADY_OPEN.getMessage(mAddress, mBlockId, mSessionId)); 317 | } 318 | mAddress = address; 319 | mBlockId = blockId; 320 | mSessionId = sessionId; 321 | mWrittenBytes = 0; 322 | mOpen = true; 323 | } 324 | 325 | @Override 326 | public void close() { 327 | if (mOpen) { 328 | mOpen = false; 329 | } 330 | } 331 | 332 | @Override 333 | public void write(byte[] bytes, int offset, int length) throws IOException { 334 | SingleResponseListener listener = null; 335 | Channel channel = null; 336 | Metrics.NETTY_BLOCK_WRITE_OPS.inc(); 337 | try { 338 | channel = BlockStoreContext.acquireNettyChannel(mAddress, mClientBootstrap); 339 | listener = new SingleResponseListener(); 340 | channel.pipeline().get(ClientHandler.class).addListener(listener); 341 | ChannelFuture channelFuture = channel.writeAndFlush( 342 | new RPCBlockWriteRequest(mSessionId, mBlockId, mWrittenBytes, length, 343 | new DataByteArrayChannel(bytes, offset, length))).sync(); 344 | if (channelFuture.isDone() && !channelFuture.isSuccess()) { 345 | LOG.error("Failed to write to %s for block %d with error %s.", mAddress.toString(), 346 | mBlockId, channelFuture.cause()); 347 | throw new IOException(channelFuture.cause()); 348 | } 349 | 350 | RPCResponse response = listener.get(NettyClient.TIMEOUT_MS, TimeUnit.MILLISECONDS); 351 | 352 | switch (response.getType()) { 353 | case RPC_BLOCK_WRITE_RESPONSE: 354 | RPCBlockWriteResponse resp = (RPCBlockWriteResponse) response; 355 | RPCResponse.Status status = resp.getStatus(); 356 | LOG.debug("status: {} from remote machine {} received", status, mAddress); 357 | 358 | if (status != RPCResponse.Status.SUCCESS) { 359 | throw new IOException(ExceptionMessage.BLOCK_WRITE_ERROR.getMessage(mBlockId, 360 | mSessionId, mAddress, status.getMessage())); 361 | } 362 | mWrittenBytes += length; 363 | break; 364 | case RPC_ERROR_RESPONSE: 365 | RPCErrorResponse error = (RPCErrorResponse) response; 366 | throw new IOException(error.getStatus().getMessage()); 367 | default: 368 | throw new IOException(ExceptionMessage.UNEXPECTED_RPC_RESPONSE 369 | .getMessage(response.getType(), RPCMessage.Type.RPC_BLOCK_WRITE_RESPONSE)); 370 | } 371 | } catch (Exception e) { 372 | Metrics.NETTY_BLOCK_WRITE_FAILURES.inc(); 373 | try { 374 | // TODO(peis): We should not close the channel unless it is an exception caused by network. 375 | if (channel != null) { 376 | channel.close().sync(); 377 | } 378 | } catch (InterruptedException ee) { 379 | Throwables.propagate(ee); 380 | } 381 | throw new IOException(e); 382 | } finally { 383 | if (channel != null && listener != null && channel.isActive()) { 384 | channel.pipeline().get(ClientHandler.class).removeListener(listener); 385 | } 386 | if (channel != null) { 387 | BlockStoreContext.releaseNettyChannel(mAddress, channel); 388 | } 389 | } 390 | } 391 | } 392 | ``` 393 | 通过下面代码写数据: 394 | ```java 395 | ChannelFuture channelFuture = channel.writeAndFlush( 396 | new RPCBlockWriteRequest(mSessionId, mBlockId, mWrittenBytes, length, 397 | new DataByteArrayChannel(bytes, offset, length))).sync(); 398 | ``` 399 | 400 | ###worker BlockWriter 401 | 如上所述,client通过Netty client向woker请求数据写,那么worker那边肯定会有相应的service handler来处理这个请求,找到BlockDataServerHandler类,在worker package中。 402 | 403 | ```java 404 | // alluxio.worker.netty.BlockDataServerHandler 405 | void handleBlockWriteRequest(final ChannelHandlerContext ctx, final RPCBlockWriteRequest req) 406 | throws IOException { 407 | final long sessionId = req.getSessionId(); 408 | final long blockId = req.getBlockId(); 409 | final long offset = req.getOffset(); 410 | final long length = req.getLength(); 411 | final DataBuffer data = req.getPayloadDataBuffer(); 412 | 413 | BlockWriter writer = null; 414 | try { 415 | req.validate(); 416 | ByteBuffer buffer = data.getReadOnlyByteBuffer(); 417 | 418 | if (offset == 0) { 419 | // This is the first write to the block, so create the temp block file. The file will only 420 | // be created if the first write starts at offset 0. This allocates enough space for the 421 | // write. 422 | mWorker.createBlockRemote(sessionId, blockId, mStorageTierAssoc.getAlias(0), length); 423 | } else { 424 | // Allocate enough space in the existing temporary block for the write. 425 | mWorker.requestSpace(sessionId, blockId, length); 426 | } 427 | writer = mWorker.getTempBlockWriterRemote(sessionId, blockId); 428 | writer.append(buffer); 429 | 430 | Metrics.BYTES_WRITTEN_REMOTE.inc(data.getLength()); 431 | RPCBlockWriteResponse resp = 432 | new RPCBlockWriteResponse(sessionId, blockId, offset, length, RPCResponse.Status.SUCCESS); 433 | ChannelFuture future = ctx.writeAndFlush(resp); 434 | future.addListener(new ClosableResourceChannelListener(writer)); 435 | } catch (Exception e) { 436 | LOG.error("Error writing remote block : {}", e.getMessage(), e); 437 | RPCBlockWriteResponse resp = 438 | RPCBlockWriteResponse.createErrorResponse(req, RPCResponse.Status.WRITE_ERROR); 439 | ChannelFuture future = ctx.writeAndFlush(resp); 440 | future.addListener(ChannelFutureListener.CLOSE); 441 | if (writer != null) { 442 | writer.close(); 443 | } 444 | } 445 | } 446 | ``` 447 | 接下来,在DefaultBlockWrite中,调用 getTempBlockWriterRemote,返回BlockWriter 448 | ``` 449 | //alluxio.worker.block.DefaultBlockWorker 450 | public BlockWriter getTempBlockWriterRemote(long sessionId, long blockId) 451 | throws BlockDoesNotExistException, IOException { 452 | return mBlockStore.getBlockWriter(sessionId, blockId); 453 | } 454 | 455 | ``` 456 | mBlockStore是TieredBlockStore,为多级存储block存储管理器。返回LocalFileBlockWriter。 457 | ``` 458 | //alluxio.worker.block.TieredBlockStore 459 | public BlockWriter getBlockWriter(long sessionId, long blockId) 460 | throws BlockDoesNotExistException, IOException { 461 | // NOTE: a temp block is supposed to only be visible by its own writer, unnecessary to acquire 462 | // block lock here since no sharing 463 | // TODO(bin): Handle the case where multiple writers compete for the same block. 464 | try (LockResource r = new LockResource(mMetadataReadLock)) { 465 | TempBlockMeta tempBlockMeta = mMetaManager.getTempBlockMeta(blockId); 466 | return new LocalFileBlockWriter(tempBlockMeta.getPath()); 467 | } 468 | } 469 | ``` 470 | 471 | localFileBlockWrite 调用write方法将数据存储起来。 472 | 473 | ``` 474 | //alluxio.worker.block.io.LocalFileBlockWriter 475 | private long write(long offset, ByteBuffer inputBuf) throws IOException { 476 | int inputBufLength = inputBuf.limit() - inputBuf.position(); 477 | MappedByteBuffer outputBuf = 478 | mLocalFileChannel.map(FileChannel.MapMode.READ_WRITE, offset, inputBufLength); 479 | outputBuf.put(inputBuf); 480 | int bytesWritten = outputBuf.limit(); 481 | BufferUtils.cleanDirectBuffer(outputBuf); 482 | return bytesWritten; 483 | } 484 | 485 | ``` 486 | 可以看到LocalFileBlockWriter将根据inputBuf的长度然后申请相应的资源,其中mLocalFileChannel.map方法调用的是jdk的FileChannelImpl的map方法,返回的是MappedByteBuffer,也即是DirectByteBuffer, 487 | DirectByteBuffer通过java的unsafe方法来直接申请系统内存,也就是堆外内存。 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | -------------------------------------------------------------------------------- /markdown/3-AlluxioRPC.md: -------------------------------------------------------------------------------- 1 | #Alluxio RPC 2 | 分两部分:**简单介绍thrift,并结合实例讲解thrift;介绍alluixo的rpc,并通过实例讲解如何给alluxio贡献源码。** 3 | 4 | ##1.Thrift 5 | alluxio使用thrift框架来作为整个alluxio集群的rpc通信。所以这边先简单介绍thrift,然后再通过实例来写一个thrift rpc的小demo。 6 | 7 | ###Thrift简介 8 | Apache Thrift Thrift是Facebook于2007年开发的跨语言的rpc服框架,提供多语言的编译功能,并提供多种服务器工作模式; 9 | 用户通过Thrift的IDL(接口定义语言)来描述接口函数及数据类型,然后通过Thrift的编译环境生成各种语言类型的接口文件, 10 | 用户可以根据自己的需要采用不同的语言开发客户端代码和服务器端代码。Thrift最大的好处是实现RPC,同时支持多语言,和ProtoBuf有点类似 11 | 但是比protobuf支持的语言多,并且thrift使用起来比较简单。 12 | 13 | ###Thrift 安装 14 | ```shell 15 | tar -xvf thrift-0.9.3.tar.gz 16 | cd thrift-0.9.3 17 | ./configure 18 | make 19 | make install 20 | ``` 21 | 使用 thrift -version 来验证Thrift是否安装成功。 22 | ###Thrift demo 23 | 首先我们来写个例子,从而让大家能够直观的感受一下thrift的魅力。我们写一个简单的client向master发送算数请求,然后返回结果的一个demo。 24 | 25 | - **1.先定义一个thrift文件** 26 | 27 | ``` 28 | namespace java com.di.thrift //thrift命令会生产 com.di.thrift.ArithmeticService的类 29 | 30 | typedef i64 long 31 | typedef i32 int 32 | service ArithmeticService { 33 | long add(1:int num1, 2:int num2), //方法一,并返回sum 34 | long multiply(1:int num1, 2:int num2), //方法二,并返回乘积 35 | } 36 | ``` 37 | - **2.thrift --gen java arithmetic.thrift 生成thrift语言代码** 38 | 39 | 代码生产后,里面包含了Client 和Server以及Iface接口,Iface封装了service的接口,需要用户去实现。 40 | 41 | - **3.新建一个ArithmeticServiceImpl class并继承ArithmeticService.Iface,如下:** 42 | 43 | ```java 44 | public class ArithmeticServiceImpl implements ArithmeticService.Iface { 45 | public long add(int num1, int num2) throws TException { 46 | return num1 + num2; 47 | } 48 | 49 | public long multiply(int num1, int num2) throws TException { 50 | return num1 * num2; 51 | } 52 | } 53 | ``` 54 | - **4.新建一个client class** 55 | 56 | ``` 57 | public class NonblockingClient { 58 | 59 | private void invoke() { 60 | TTransport transport; 61 | try { 62 | transport = new TFramedTransport(new TSocket("localhost", 7911)); 63 | TProtocol protocol = new TBinaryProtocol(transport); 64 | 65 | ArithmeticService.Client client = new ArithmeticService.Client(protocol); 66 | transport.open(); 67 | 68 | long addResult = client.add(100, 200); 69 | System.out.println("Add result: " + addResult); 70 | long multiplyResult = client.multiply(20, 40); 71 | System.out.println("Multiply result: " + multiplyResult); 72 | 73 | transport.close(); 74 | } catch (TTransportException e) { 75 | e.printStackTrace(); 76 | } catch (TException e) { 77 | e.printStackTrace(); 78 | } 79 | } 80 | 81 | public static void main(String[] args) { 82 | NonblockingClient c = new NonblockingClient(); 83 | c.invoke(); 84 | } 85 | 86 | } 87 | 88 | ``` 89 | 90 | - **5.新建一个server class** 91 | 92 | ``` 93 | public class NonblockingServer { 94 | 95 | private void start() { 96 | try { 97 | TNonblockingServerTransport serverTransport = new TNonblockingServerSocket(7911); 98 | ArithmeticService.Processor processor = new ArithmeticService.Processor(new ArithmeticServiceImpl()); 99 | 100 | TServer server = new TNonblockingServer(new TNonblockingServer.Args(serverTransport). 101 | processor(processor)); 102 | System.out.println("Starting server on port 7911 ..."); 103 | server.serve(); 104 | } catch (TTransportException e) { 105 | e.printStackTrace(); 106 | } 107 | } 108 | 109 | public static void main(String[] args) { 110 | NonblockingServer srv = new NonblockingServer(); 111 | srv.start(); 112 | } 113 | } 114 | ``` 115 | 116 | ok,现在你可以启动server代码,然后再启动client代码,看看client段是不是成功的将计算结果获取到。如果想体验一把跨机器的话,可以分别将client 和server代码 117 | 打成jar包,放到不同的节点,同时将Socket hostname改成相应server机器的hostname或者ip。 118 | 119 | 120 | ###Thrift 框架 121 | 体验完一把thrift的RPC之后,我们来详细的介绍一下Thrift框架,加深大家对thrift的理解。介绍完thrift之后,我们拿alluxio的client和server 122 | 代码来进行分析讲解。 123 | 124 | Thrift 包含一个完整的堆栈结构用于构建客户端和服务器端。下图描绘了 Thrift 的整体架构。 125 | 126 | ![架构图](../Graphs/thrift-architecture.png) 127 | 如图所示是thrift的协议栈整体的架构,thrift是一个客户端和服务器端的架构体系(c/s),在最上层是用户自行实现的业务逻辑代码。 第二层是由thrift编译器自动生成的代码, 128 | 主要用于结构化数据的解析,发送和接收。TServer主要任务是高效的接受客户端请求,并将请求转发给Processor处理。Processor负责对客户端的请求做出响应,包括RPC请求转发, 129 | 调用参数解析和用户逻辑调用,返回值写回等处理。从TProtocol以下部分是thirft的传输协议和底层I/O通信。TProtocol是用于数据类型解析的,将结构化数据转化为字节流给TTransport进行传输。 130 | TTransport是与底层数据传输密切相关的传输层,负责以字节流方式接收和发送消息体,不关注是什么数据类型。底层IO负责实际的数据传输,包括socket、文件和压缩数据流等。 131 | 132 | **1.数据类型** 133 | Thrift 脚本可定义的数据类型包括以下几种类型: 134 | 135 | - 基本类型: 136 | ``` 137 | bool:布尔值,true 或 false,对应 Java 的 boolean 138 | byte:8 位有符号整数,对应 Java 的 byte 139 | i16:16 位有符号整数,对应 Java 的 short 140 | i32:32 位有符号整数,对应 Java 的 int 141 | i64:64 位有符号整数,对应 Java 的 long 142 | double:64 位浮点数,对应 Java 的 double 143 | string:未知编码文本或二进制字符串,对应 Java 的 String 144 | ``` 145 | - 结构体类型: 146 | ``` 147 | struct:定义公共的对象,类似于 C 语言中的结构体定义,在 Java 中是一个 JavaBean 148 | ``` 149 | - 容器类型: 150 | ``` 151 | list:对应 Java 的 ArrayList 152 | set:对应 Java 的 HashSet 153 | map:对应 Java 的 HashMap 154 | ``` 155 | - 异常类型: 156 | ``` 157 | exception:对应 Java 的 Exception 158 | ``` 159 | - 服务类型: 160 | ``` 161 | service:对应服务的类 162 | ``` 163 | 在上面的例子中,我们用到了一些基本的数据类型和service服务类型。thrift会根据不同的声明生成不同的code。 164 | 165 | **2.协议层** 166 | 167 | Thrift 可以让用户选择客户端与服务端之间传输通信协议的类别,在传输协议上总体划分为文本 (text) 和二进制 (binary) 传输协议,为节约带宽,提高传输效率,一般情况下使用二进制类型的传输协议为多数, 168 | 有时还会使用基于文本类型的协议,这需要根据项目 / 产品中的实际需求。常用协议有以下几种: 169 | 170 | - TBinaryProtocol : 二进制编码格式进行数据传输。用法如demo所示。 171 | - TJSONProtocol :使用 JSON 的数据编码协议进行数据传输。 172 | ``` 173 | //client 174 | TJSONProtocol protocol = new TJSONProtocol(transport); 175 | //server 176 | TJSONProtocol.Factory proFactory = new TJSONProtocol.Factory(); 177 | ``` 178 | - TSimpleJSONProtoco :只提供 JSON 只写的协议,适用于通过脚本语言解析。 179 | - TCompactProtocol : 高效率的、密集的二进制编码格式进行数据传输。 180 | ```java 181 | //用法,client端替换掉TBinaryProtocol 182 | TCompactProtocol protocol = new TCompactProtocol(transport); 183 | // server端 184 | TCompactProtocol.Factory proFactory = new TCompactProtocol.Factory(); 185 | 通过args.protocolFactory()将上述参数传入。 186 | ``` 187 | 188 | **3.传输层** 189 | 190 | 常用的传输层有以下几种: 191 | 192 | - TSocket : 使用阻塞式 I/O 进行传输,是最简单的模式 193 | - TFramedTransport :使用非阻塞方式,按块的大小进行传输,类似于 Java 中的 NIO,如demo。 194 | - TNonblockingTransport :使用非阻塞方式,用于构建异步客户端 195 | - TFileTransport:顾名思义按照文件的方式进程传输,虽然这种方式不提供Java的实现,但是实现起来非常简单。 196 | - TZlibTransport:使用执行zlib压缩,不提供Java的实现。 197 | 198 | **4.服务类型** 199 | 200 | 常见的服务端类型有以下几种: 201 | 202 | - TSimpleServer : 单线程服务器端使用标准的阻塞式 I/O 203 | - TThreadPoolServer : 多线程服务器端使用标准的阻塞式 I/O 204 | - TNonblockingServer : 多线程服务器端使用非阻塞式 I/O 205 | 206 | ### 安全非阻塞RPC 207 | 用户先使用 keytool 生成keystore文件如: 208 | ``` 209 | keytool-genkeypair -alias certificatekey -keyalg RSA -validity 36500 -keystore .keystore 210 | ``` 211 | 212 | **client** 213 | ``` 214 | public class SecureClient { 215 | private void invoke() { 216 | TTransport transport; 217 | try { 218 | TSSLTransportFactory.TSSLTransportParameters params = 219 | new TSSLTransportFactory.TSSLTransportParameters(); 220 | params.setTrustStore("tests\\src\\resouces\\truststore.jks", "111111"); 221 | transport = TSSLTransportFactory.getClientSocket("localhost", 7911, 10000, params); 222 | TProtocol protocol = new TBinaryProtocol(transport); 223 | ArithmeticService.Client client = new ArithmeticService.Client(protocol); 224 | long addResult = client.add(100, 200); 225 | System.out.println("Add result: " + addResult); 226 | long multiplyResult = client.multiply(20, 40); 227 | System.out.println("Multiply result: " + multiplyResult); 228 | transport.close(); 229 | } catch (TTransportException e) { 230 | e.printStackTrace(); 231 | } catch (TException e) { 232 | e.printStackTrace(); 233 | } 234 | } 235 | public static void main(String[] args) { 236 | SecureClient c = new SecureClient(); 237 | c.invoke(); 238 | } 239 | } 240 | ``` 241 | **server** 242 | ``` 243 | public class SecureServer { 244 | private void start() { 245 | try { 246 | TSSLTransportFactory.TSSLTransportParameters params = 247 | new TSSLTransportFactory.TSSLTransportParameters(); 248 | params.setKeyStore("tests\\src\\resouces\\keystore.jks", "111111"); 249 | TServerSocket serverTransport = TSSLTransportFactory.getServerSocket( 250 | 7911, 10000, InetAddress.getByName("localhost"), params); 251 | ArithmeticService.Processor processor = new ArithmeticService.Processor(new ArithmeticServiceImpl()); 252 | 253 | TServer server = new TThreadPoolServer(new TThreadPoolServer.Args(serverTransport). 254 | processor(processor)); 255 | System.out.println("Starting server on port 7911 ..."); 256 | server.serve(); 257 | } catch (TTransportException e) { 258 | e.printStackTrace(); 259 | } catch (UnknownHostException e) { 260 | 261 | } 262 | } 263 | public static void main(String[] args) { 264 | SecureServer srv = new SecureServer(); 265 | srv.start(); 266 | } 267 | } 268 | ``` 269 | thrift 如何集成kerberos,请参考第11章节。 270 | 271 | ##2.Alluxio rpc 272 | alluxio使用thrift作为其RPC通信框架,下面我们通过alluxio中的一个thrift文件来进行分析。如下: 273 | 274 | ``` 275 | // core/common/src/thrift/file_system_master.thrift 276 | namespace java alluxio.thrift 277 | 278 | include "common.thrift" 279 | include "exception.thrift" 280 | struct CreateDirectoryTOptions { 281 | 1: optional bool persisted 282 | 2: optional bool recursive 283 | 3: optional bool allowExists 284 | 4: optional i16 mode 285 | 5: optional i64 ttl 286 | 6: optional common.TTtlAction ttlAction 287 | } 288 | 289 | struct CreateFileTOptions { 290 | 1: optional i64 blockSizeBytes 291 | 2: optional bool persisted 292 | 3: optional bool recursive 293 | 4: optional i64 ttl 294 | 5: optional i16 mode 295 | 6: optional common.TTtlAction ttlAction 296 | } 297 | 298 | struct FileSystemCommand { 299 | 1: common.CommandType commandType 300 | 2: FileSystemCommandOptions commandOptions 301 | } 302 | 303 | struct PersistCommandOptions { 304 | 1: list persistFiles 305 | } 306 | 307 | struct PersistFile { 308 | 1: i64 fileId 309 | 2: list blockIds 310 | } 311 | 312 | struct SetAttributeTOptions { 313 | 1: optional bool pinned 314 | 2: optional i64 ttl 315 | 3: optional bool persisted 316 | 4: optional string owner 317 | 5: optional string group 318 | 6: optional i16 mode 319 | 7: optional bool recursive 320 | 8: optional common.TTtlAction ttlAction 321 | } 322 | 323 | union FileSystemCommandOptions { 324 | 1: optional PersistCommandOptions persistOptions 325 | } 326 | 327 | /** 328 | * This interface contains file system master service endpoints for Alluxio clients. 329 | */ 330 | service FileSystemMasterClientService extends common.AlluxioService { 331 | 332 | /** 333 | * Creates a directory. 334 | */ 335 | void createDirectory( 336 | /** the path of the directory */ 1: string path, 337 | /** the method options */ 2: CreateDirectoryTOptions options, 338 | ) 339 | throws (1: exception.AlluxioTException e, 2: exception.ThriftIOException ioe) 340 | 341 | /** 342 | * Creates a file. 343 | */ 344 | void createFile( 345 | /** the path of the file */ 1: string path, 346 | /** the options for creating the file */ 2: CreateFileTOptions options, 347 | ) 348 | throws (1: exception.AlluxioTException e, 2: exception.ThriftIOException ioe) 349 | 350 | /** 351 | * Frees the given file or directory from Alluxio. 352 | */ 353 | void free( 354 | /** the path of the file or directory */ 1: string path, 355 | /** whether to free recursively */ 2: bool recursive, 356 | ) 357 | throws (1: exception.AlluxioTException e) 358 | 359 | /** 360 | * Returns the UFS address of the root mount point. 361 | * 362 | * THIS METHOD IS DEPRECATED SINCE VERSION 1.1 AND WILL BE REMOVED IN VERSION 2.0. 363 | */ 364 | string getUfsAddress() throws (1: exception.AlluxioTException e) 365 | 366 | /** 367 | * If the path points to a file, the method returns a singleton with its file information. 368 | * If the path points to a directory, the method returns a list with file information for the 369 | * directory contents. 370 | */ 371 | list listStatus( 372 | /** the path of the file or directory */ 1: string path, 373 | /** listStatus options */ 2: ListStatusTOptions options, 374 | ) 375 | throws (1: exception.AlluxioTException e) 376 | 377 | /** 378 | * Creates a new "mount point", mounts the given UFS path in the Alluxio namespace at the given 379 | * path. The path should not exist and should not be nested under any existing mount point. 380 | */ 381 | void mount( 382 | /** the path of alluxio mount point */ 1: string alluxioPath, 383 | /** the path of the under file system */ 2: string ufsPath, 384 | /** the options for creating the mount point */ 3: MountTOptions options, 385 | ) 386 | throws (1: exception.AlluxioTException e, 2: exception.ThriftIOException ioe) 387 | 388 | /** 389 | * Deletes a file or a directory and returns whether the remove operation succeeded. 390 | * NOTE: Unfortunately, the method cannot be called "delete" as that is a reserved Thrift keyword. 391 | */ 392 | void remove( 393 | /** the path of the file or directory */ 1: string path, 394 | /** whether to remove recursively */ 2: bool recursive, 395 | ) 396 | throws (1: exception.AlluxioTException e) 397 | 398 | /** 399 | * Renames a file or a directory. 400 | */ 401 | void rename( 402 | /** the path of the file or directory */ 1: string path, 403 | /** the desinationpath of the file */ 2: string dstPath, 404 | ) 405 | throws (1: exception.AlluxioTException e, 2: exception.ThriftIOException ioe) 406 | 407 | /** 408 | * Sets file or directory attributes. 409 | */ 410 | void setAttribute( 411 | /** the path of the file or directory */ 1: string path, 412 | /** the method options */ 2: SetAttributeTOptions options, 413 | ) 414 | throws (1: exception.AlluxioTException e) 415 | 416 | } 417 | 418 | ``` 419 | 对于上述thrift语言文件,我删除了一部分不然篇幅就显得太长了。可以清楚的看到,struct定义了很多文件操作相对于的操作参数。而service定义了很多文件操作接口,并将struct作为操作参数。 420 | 421 | - **这里有需要注意的地方,如果大家修改了thrift文件,不需要使用thrift命令来生产thrift java代码,alluxio提供了脚本来生产,alluxio thriftGen就可以自动生成修改过得thrift文件** 422 | 423 | 424 | 接下来通过client端创建一个文件夹为例来说明alluxio如何进行RPC通信的。 425 | ``` 426 | alluxio fs mkdir /your/alluxio/path 427 | ``` 428 | ### shell 429 | 找到MkdirCommand.java文件: /shell/src/main/java/shell/command/MkdirCommand.java 430 | 431 | ```java 432 | public void run(CommandLine cl) throws AlluxioException, IOException { 433 | String[] args = cl.getArgs(); 434 | for (String path : args) { 435 | AlluxioURI inputPath = new AlluxioURI(path); 436 | 437 | CreateDirectoryOptions options = CreateDirectoryOptions.defaults().setRecursive(true); 438 | mFileSystem.createDirectory(inputPath, options); 439 | System.out.println("Successfully created directory " + inputPath); 440 | } 441 | } 442 | ``` 443 | 代码很简单,首先new一个CreateDirectoryOptions的类。 444 | 然后调用client端的createDirectory方法。 445 | - 这边注意一点CreateDirectoryOptions是client端自己写的一个类,它和thrift文件中定义的类不一样CreateDirectoryTOptions,所以自己实现选项类的时候需要加一个toThrift方法,将自己的参数转换为thrift API能够接受的参数。 446 | 447 | 下面来看看client端的代码和server端的代码: 448 | ###client 449 | 找到BaseFileSystem 类中的createDirectory方法。 450 | ``` java 451 | @Override 452 | public void createDirectory(AlluxioURI path, CreateDirectoryOptions options) 453 | throws FileAlreadyExistsException, InvalidPathException, IOException, AlluxioException { 454 | FileSystemMasterClient masterClient = mFileSystemContext.acquireMasterClient(); 455 | try { 456 | masterClient.createDirectory(path, options); 457 | LOG.debug("Created directory " + path.getPath()); 458 | } finally { 459 | mFileSystemContext.releaseMasterClient(masterClient); 460 | } 461 | } 462 | ``` 463 | 接着调用FileSystemMasterClient的createDirectory 464 | ``` 465 | public synchronized void createDirectory(final AlluxioURI path, 466 | final CreateDirectoryOptions options) throws IOException, AlluxioException { 467 | retryRPC(new RpcCallableThrowsAlluxioTException() { 468 | @Override 469 | public Void call() throws AlluxioTException, TException { 470 | mClient.createDirectory(path.getPath(), options.toThrift()); 471 | return null; 472 | } 473 | }); 474 | } 475 | ``` 476 | 看到这里,才是真正意义上的调用thrift生成的client代码,它接收的参数options必须转换成带T的类。 477 | ``` 478 | mClient.createDirectory(path.getPath(), options.toThrift()); 479 | ``` 480 | mClient: FileSystemMasterClientService 类是我们定义的thrift文件生成的class,里面的Iface是必须要实现的。可以找到master端的实现代码。 481 | 482 | 483 | ###server 484 | FileSystemMasterClientServiceHandler类为IFace的实现代码,但是真正操作元数据的是FileSystemMaster这个类。 485 | 486 | ``` 487 | public void createDirectory(final String path, final CreateDirectoryTOptions options) 488 | throws AlluxioTException, ThriftIOException { 489 | RpcUtils.call(new RpcCallableThrowsIOException() { 490 | @Override 491 | public Void call() throws AlluxioException, IOException { 492 | mFileSystemMaster.createDirectory(new AlluxioURI(path), 493 | new CreateDirectoryOptions(options)); 494 | return null; 495 | } 496 | }); 497 | } 498 | ``` 499 | 如FileSystemMasterClientServiceHandler类中createDirectory方法,调用了FileSystemMaster的createDirectory方法 500 | 501 | ``` 502 | public void createDirectory(AlluxioURI path, CreateDirectoryOptions options) 503 | throws InvalidPathException, FileAlreadyExistsException, IOException, AccessControlException, 504 | FileDoesNotExistException { 505 | LOG.debug("createDirectory {} ", path); 506 | Metrics.CREATE_DIRECTORIES_OPS.inc(); 507 | long flushCounter = AsyncJournalWriter.INVALID_FLUSH_COUNTER; 508 | try (LockedInodePath inodePath = mInodeTree.lockInodePath(path, InodeTree.LockMode.WRITE)) { 509 | mPermissionChecker.checkParentPermission(Mode.Bits.WRITE, inodePath); 510 | mMountTable.checkUnderWritableMountPoint(path); 511 | flushCounter = createDirectoryAndJournal(inodePath, options); 512 | } finally { 513 | // finally runs after resources are closed (unlocked). 514 | waitForJournalFlush(flushCounter); 515 | } 516 | } 517 | ``` 518 | 如上,先对path进行WRITE加锁,然后再对该路径进行权限认证,如果没有create的权限的话,会直接报AccessControlException异常的。 519 | 520 | - **alluxio目前提供的权限认证还是太水,用户通过设置username就可以修改成相应用户,从而获得该用户的权限** 521 | 522 | 最后,会将dir的元数据加入到InodeTree中。接下来看看InodeTree数据结构。 523 | 524 | ### InodeTree 525 | alluxio自己实现了一种数据结构来存储文件元数据。 526 | ``` 527 | public void initializeRoot(Permission permission) { 528 | if (mRoot == null) { 529 | mRoot = InodeDirectory 530 | .create(mDirectoryIdGenerator.getNewDirectoryId(), NO_PARENT, ROOT_INODE_NAME, 531 | CreateDirectoryOptions.defaults().setPermission(permission)); 532 | mRoot.setPersistenceState(PersistenceState.PERSISTED); 533 | mInodes.add(mRoot); 534 | mCachedInode = mRoot; 535 | } 536 | } 537 | ``` 538 | 所有的节点都放到了mInodes里面,mInodes数据结构是: 539 | ``` 540 | private final FieldIndex> mInodes = new UniqueFieldIndex<>(ID_INDEX); 541 | ``` 542 | 下面看看UniqueFieldIndex类。 543 | ``` 544 | public class UniqueFieldIndex implements FieldIndex { 545 | private final IndexDefinition mIndexDefinition; 546 | private final ConcurrentHashMapV8 mIndexMap; 547 | 548 | /** 549 | * Constructs a new {@link UniqueFieldIndex} instance. 550 | * 551 | * @param indexDefinition definition of index 552 | */ 553 | public UniqueFieldIndex(IndexDefinition indexDefinition) { 554 | mIndexMap = new ConcurrentHashMapV8<>(8, 0.95f, 8); 555 | mIndexDefinition = indexDefinition; 556 | } 557 | 558 | @Override 559 | public boolean add(T object) { 560 | Object fieldValue = mIndexDefinition.getFieldValue(object); 561 | T previousObject = mIndexMap.putIfAbsent(fieldValue, object); 562 | 563 | if (previousObject != null && previousObject != object) { 564 | return false; 565 | } 566 | return true; 567 | } 568 | 569 | @Override 570 | public boolean remove(T object) { 571 | Object fieldValue = mIndexDefinition.getFieldValue(object); 572 | return mIndexMap.remove(fieldValue, object); 573 | } 574 | 575 | @Override 576 | public void clear() { 577 | mIndexMap.clear(); 578 | } 579 | 580 | @Override 581 | public boolean containsField(Object fieldValue) { 582 | return mIndexMap.containsKey(fieldValue); 583 | } 584 | 585 | @Override 586 | public boolean containsObject(T object) { 587 | Object fieldValue = mIndexDefinition.getFieldValue(object); 588 | T res = mIndexMap.get(fieldValue); 589 | if (res == null) { 590 | return false; 591 | } 592 | return res == object; 593 | } 594 | 595 | @Override 596 | public Set getByField(Object value) { 597 | T res = mIndexMap.get(value); 598 | if (res != null) { 599 | return Collections.singleton(res); 600 | } 601 | return Collections.emptySet(); 602 | } 603 | 604 | @Override 605 | public T getFirst(Object value) { 606 | return mIndexMap.get(value); 607 | } 608 | 609 | @Override 610 | public Iterator iterator() { 611 | return mIndexMap.values().iterator(); 612 | } 613 | 614 | @Override 615 | public int size() { 616 | return mIndexMap.size(); 617 | } 618 | } 619 | 620 | ``` 621 | 其中由ConcurrentHashMapV8 和IndexDefinition来维护元数据,ConcurrentHashMapV8是netty实现的java8 的concurrentHashMap。 IndexDefinition用来维护id的唯一性。 622 | 623 | add的时候,首先获取object的id,然后id作为key,put到hashmap中。同时在插入的过程中,会将path转换成Inode,Inode分为两种:InodeFile和InodeDirectory,InodeFile会有指向parent的引用,一样,InodeDirectory 624 | 会用child数组来记录自己的孩子节点。 625 | 626 | 627 | --------------------------------------------------------------------------------