Hadoop has an abstract notion of filesystem, of which HDFS is just
one implementation. The Java abstract class org.apache.hadoop.fs.FileSystem represents a
filesystem in Hadoop, and there are several concrete implementations,
which are described in Table 3.1
Table 3.1. Hadoop filesystems
| Filesystem | URI scheme | Java implementation (all under org.apache.hadoop) | Description |
|---|---|---|---|
| Local | file | fs.LocalFileSystem | A filesystem for a locally connected disk with
client-side checksums. Use RawLocalFileSystem for a local
filesystem with no checksums. |
| HDFS | hdfs | hdfs.DistributedFileSystem | Hadoop’s distributed filesystem. HDFS is designed to work efficiently in conjunction with MapReduce. |
| HFTP | hftp | hdfs.HftpFileSystem | A filesystem providing read-only access to HDFS over HTTP. (Despite its name, HFTP has no connection with FTP.) Often used with distcp to copy data between HDFS clusters running different versions. |
| HSFTP | hsftp | hdfs.HsftpFileSystem | A filesystem providing read-only access to HDFS over HTTPS. (Again, this has no connection with FTP.) |
| HAR | har | fs.HarFileSystem | A filesystem layered on another filesystem for archiving files. Hadoop Archives are typically used for archiving files in HDFS to reduce the namenode’s memory usage. |
| KFS (CloudStore) | kfs | fs.kfs.KosmosFileSystem | CloudStore (formerly Kosmos filesystem) is a distributed filesystem like HDFS or Google’s GFS, written in C++. Find more information about it at http://kosmosfs.sourceforge.net/. |
| FTP | ftp | fs.ftp.FTPFileSystem | A filesystem backed by an FTP server. |
| S3 (native) | s3n | fs.s3native.NativeS3FileSystem | A filesystem backed by Amazon S3. See http://wiki.apache.org/hadoop/AmazonS3. |
| S3 (block-based) | s3 | fs.s3.S3FileSystem | A filesystem backed by Amazon S3, which stores files in blocks (much like HDFS) to overcome S3’s 5 GB file size limit. |
Hadoop provides many interfaces to its filesystems, and it generally uses the URI scheme to pick the correct filesystem instance to communicate with. For example, the filesystem shell that we met in the previous section operates with all Hadoop filesystems. To list the files in the root directory of the local filesystem, type:
%hadoop fs -ls file:///
Although it is possible (and sometimes very convenient) to run MapReduce programs that access any of these filesystems, when you are processing large volumes of data, you should choose a distributed filesystem that has the data locality optimization, such as HDFS or KFS.
Hadoop is written in Java, and all Hadoop filesystem
interactions are mediated through the Java API.[25] The filesystem shell, for example, is a Java application
that uses the Java FileSystem class
to provide filesystem operations. The other filesystem interfaces are
discussed briefly in this section. These interfaces are most commonly
used with HDFS, since the other filesystems in Hadoop typically have
existing tools to access the underlying filesystem (FTP clients for
FTP, S3 tools for S3, etc.), but many of them will work with any
Hadoop filesystem.
By exposing its filesystem interface as a Java API, Hadoop makes it awkward for non-Java applications to access Hadoop filesystems. The Thrift API in the “thriftfs” contrib module remedies this deficiency by exposing Hadoop filesystems as an Apache Thrift service, making it easy for any language that has Thrift bindings to interact with a Hadoop filesystem, such as HDFS.
To use the Thrift API, run a Java server that exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the same machine as your application.
The Thrift API comes with a number of pregenerated stubs for a variety of languages, including C++, Perl, PHP, Python, and Ruby. Thrift has support for versioning, so it’s a good choice if you want to access different versions of a Hadoop filesystem from the same client code (you will need to run a proxy for each version of Hadoop to achieve this, however).
For installation and usage instructions, please refer to the
documentation in the src/contrib/thriftfs directory of
the Hadoop distribution.
Hadoop provides a C library called libhdfs that mirrors the Java
FileSystem interface (it was
written as a C library for accessing HDFS, but despite its name it
can be used to access any Hadoop filesystem). It works using the
Java Native Interface (JNI) to call a Java
filesystem client.
The C API is very similar to the Java one, but it typically
lags the Java one, so newer features may not be supported. You can
find the generated documentation for the C API in the libhdfs/docs/api directory of the Hadoop
distribution.
Hadoop comes with prebuilt libhdfs binaries for 32-bit Linux, but
for other platforms, you will need to build them yourself using the
instructions at http://wiki.apache.org/hadoop/LibHDFS.
Filesystem in Userspace (FUSE) allows
filesystems that are implemented in user space to be integrated as a
Unix filesystem. Hadoop’s Fuse-DFS contrib module allows any Hadoop
filesystem (but typically HDFS) to be mounted as a standard
filesystem. You can then use Unix utilities (such as
ls and cat) to interact with
the filesystem, as well as POSIX libraries to access the filesystem
from any programming language.
Fuse-DFS is implemented in C using libhdfs as the interface to HDFS.
Documentation for compiling and running Fuse-DFS is located in the
src/contrib/fuse-dfs directory
of the Hadoop distribution.
WebDAV is a set of extensions to HTTP to support editing and updating files. WebDAV shares can be mounted as filesystems on most operating systems, so by exposing HDFS (or other Hadoop filesystems) over WebDAV, it’s possible to access HDFS as a standard filesystem.
At the time of this writing, WebDAV support in Hadoop (which is implemented by calling the Java API to Hadoop) is still under development, and can be tracked at https://issues.apach...owse/HADOOP-496.
There are two interfaces that are specific to HDFS:
- HTTP
HDFS defines a read-only interface for retrieving directory listings and data over HTTP. Directory listings are served by the namenode’s embedded web server (which runs on port 50070) in XML format, while file data is streamed from datanodes by their web servers (running on port 50075). This protocol is not tied to a specific HDFS version, making it possible to write clients that can use HTTP to read data from HDFS clusters that run different versions of Hadoop.
HftpFileSystemis a such a client: it is a Hadoop filesystem that talks to HDFS over HTTP (HsftpFileSystemis the HTTPS variant).- FTP
Although not complete at the time of this writing (https://issues.apach...wse/HADOOP-3199), there is an FTP interface to HDFS, which permits the use of the FTP protocol to interact with HDFS. This interface is a convenient way to transfer data into and out of HDFS using existing FTP clients.
The FTP interface to HDFS is not to be confused with
FTPFileSystem, which exposes any FTP server as a Hadoop filesystem.
[25] The RPC interfaces in Hadoop are based on Hadoop’s Writable interface, which is
Java-centric. In the future, Hadoop will adopt another,
cross-language, RPC serialization format, which will allow native
HDFS clients to be written in languages other than Java.
Apache Hadoop is ideal for organizations with a growing need to process massive application datasets. Hadoop: The Definitive Guide is a comprehensive resource for using Hadoop to build reliable, scalable, distributed systems. Programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. The book includes case studies that illustrate how Hadoop is used to solve specific problems.




Help







