Jump to content

Get to Know Hadoop Filesystems

0
  tomwhite's Photo
Posted Oct 28 2009 11:33 AM

Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation. The Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and there are several concrete implementations, which are described in Table 3.1

Table 3.1. Hadoop filesystems

FilesystemURI schemeJava implementation (all under org.apache.hadoop)Description
Localfilefs.LocalFileSystemA filesystem for a locally connected disk with client-side checksums. Use RawLocalFileSystem for a local filesystem with no checksums.
HDFShdfshdfs.DistributedFileSystemHadoop’s distributed filesystem. HDFS is designed to work efficiently in conjunction with MapReduce.
HFTPhftphdfs.HftpFileSystemA filesystem providing read-only access to HDFS over HTTP. (Despite its name, HFTP has no connection with FTP.) Often used with distcp to copy data between HDFS clusters running different versions.
HSFTPhsftphdfs.HsftpFileSystemA filesystem providing read-only access to HDFS over HTTPS. (Again, this has no connection with FTP.)
HARharfs.HarFileSystemA filesystem layered on another filesystem for archiving files. Hadoop Archives are typically used for archiving files in HDFS to reduce the namenode’s memory usage.
KFS (CloudStore)kfsfs.kfs.KosmosFileSystemCloudStore (formerly Kosmos filesystem) is a distributed filesystem like HDFS or Google’s GFS, written in C++. Find more information about it at http://kosmosfs.sourceforge.net/.
FTPftpfs.ftp.FTPFileSystemA filesystem backed by an FTP server.
S3 (native)s3nfs.s3native.NativeS3FileSystemA filesystem backed by Amazon S3. See http://wiki.apache.org/hadoop/AmazonS3.
S3 (block-based)s3fs.s3.S3FileSystemA filesystem backed by Amazon S3, which stores files in blocks (much like HDFS) to overcome S3’s 5 GB file size limit.

Hadoop provides many interfaces to its filesystems, and it generally uses the URI scheme to pick the correct filesystem instance to communicate with. For example, the filesystem shell that we met in the previous section operates with all Hadoop filesystems. To list the files in the root directory of the local filesystem, type:

% hadoop fs -ls file:///

Although it is possible (and sometimes very convenient) to run MapReduce programs that access any of these filesystems, when you are processing large volumes of data, you should choose a distributed filesystem that has the data locality optimization, such as HDFS or KFS.

Interfaces

Hadoop is written in Java, and all Hadoop filesystem interactions are mediated through the Java API.[25] The filesystem shell, for example, is a Java application that uses the Java FileSystem class to provide filesystem operations. The other filesystem interfaces are discussed briefly in this section. These interfaces are most commonly used with HDFS, since the other filesystems in Hadoop typically have existing tools to access the underlying filesystem (FTP clients for FTP, S3 tools for S3, etc.), but many of them will work with any Hadoop filesystem.

Thrift

By exposing its filesystem interface as a Java API, Hadoop makes it awkward for non-Java applications to access Hadoop filesystems. The Thrift API in the “thriftfs” contrib module remedies this deficiency by exposing Hadoop filesystems as an Apache Thrift service, making it easy for any language that has Thrift bindings to interact with a Hadoop filesystem, such as HDFS.

To use the Thrift API, run a Java server that exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the same machine as your application.

The Thrift API comes with a number of pregenerated stubs for a variety of languages, including C++, Perl, PHP, Python, and Ruby. Thrift has support for versioning, so it’s a good choice if you want to access different versions of a Hadoop filesystem from the same client code (you will need to run a proxy for each version of Hadoop to achieve this, however).

For installation and usage instructions, please refer to the documentation in the src/contrib/thriftfs directory of the Hadoop distribution.

C

Hadoop provides a C library called libhdfs that mirrors the Java FileSystem interface (it was written as a C library for accessing HDFS, but despite its name it can be used to access any Hadoop filesystem). It works using the Java Native Interface (JNI) to call a Java filesystem client.

The C API is very similar to the Java one, but it typically lags the Java one, so newer features may not be supported. You can find the generated documentation for the C API in the libhdfs/docs/api directory of the Hadoop distribution.

Hadoop comes with prebuilt libhdfs binaries for 32-bit Linux, but for other platforms, you will need to build them yourself using the instructions at http://wiki.apache.org/hadoop/LibHDFS.

FUSE

Filesystem in Userspace (FUSE) allows filesystems that are implemented in user space to be integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows any Hadoop filesystem (but typically HDFS) to be mounted as a standard filesystem. You can then use Unix utilities (such as ls and cat) to interact with the filesystem, as well as POSIX libraries to access the filesystem from any programming language.

Fuse-DFS is implemented in C using libhdfs as the interface to HDFS. Documentation for compiling and running Fuse-DFS is located in the src/contrib/fuse-dfs directory of the Hadoop distribution.

WebDAV

WebDAV is a set of extensions to HTTP to support editing and updating files. WebDAV shares can be mounted as filesystems on most operating systems, so by exposing HDFS (or other Hadoop filesystems) over WebDAV, it’s possible to access HDFS as a standard filesystem.

At the time of this writing, WebDAV support in Hadoop (which is implemented by calling the Java API to Hadoop) is still under development, and can be tracked at https://issues.apach...owse/HADOOP-496.

Other HDFS Interfaces

There are two interfaces that are specific to HDFS:

HTTP

HDFS defines a read-only interface for retrieving directory listings and data over HTTP. Directory listings are served by the namenode’s embedded web server (which runs on port 50070) in XML format, while file data is streamed from datanodes by their web servers (running on port 50075). This protocol is not tied to a specific HDFS version, making it possible to write clients that can use HTTP to read data from HDFS clusters that run different versions of Hadoop. HftpFileSystem is a such a client: it is a Hadoop filesystem that talks to HDFS over HTTP (HsftpFileSystem is the HTTPS variant).

FTP

Although not complete at the time of this writing (https://issues.apach...wse/HADOOP-3199), there is an FTP interface to HDFS, which permits the use of the FTP protocol to interact with HDFS. This interface is a convenient way to transfer data into and out of HDFS using existing FTP clients.

The FTP interface to HDFS is not to be confused with FTPFileSystem, which exposes any FTP server as a Hadoop filesystem.

[25] The RPC interfaces in Hadoop are based on Hadoop’s Writable interface, which is Java-centric. In the future, Hadoop will adopt another, cross-language, RPC serialization format, which will allow native HDFS clients to be written in languages other than Java.

Cover of Hadoop: The Definitive Guide
Learn more about this topic from Hadoop: The Definitive Guide. 

Apache Hadoop is ideal for organizations with a growing need to process massive application datasets. Hadoop: The Definitive Guide is a comprehensive resource for using Hadoop to build reliable, scalable, distributed systems. Programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. The book includes case studies that illustrate how Hadoop is used to solve specific problems.

Learn More Read Now on Safari


Tags:
0 Subscribe


0 Replies