Thursday, September 24, 2015

HDFS File Read : How it works in highlevel

After you read this post, you will be able to understand how the file read operation works in HDFS. It is under presumption that you are reading all the posts in my blog in order.

I would recommend you to read my Other posts "HDFS Part I" before continue reading this to understand the architecture of HDFS first..

Let's consider the data which we posted in earlier post for READ operation.

Source :Google


As soon as client receives the read request from User, it connects with Name node for the list of blocks that has to read to fulfil user request.

Block Locations are stored in metadata of Name Node.

For each block, there must be 3 nodes(since replication factor is 3 in our example), so each block corresponding with data node address will be in sorting order in Name Node. As soon as client request for blocks to read, Name Node results the data nodes that are nearest in the network for read operation.

Once the list of data nodes are identified then client starts reading the data blocks directly from data node. When it read all the blocks, it closes the connection & give back the results to user.

Here Name Node will not involve in such reading operation, it will just give the address of the data nodes which holds the block.

Hope you understand the basics of how the file read works, please post your comments/suggestions if any..



Disclaimer : All content provided on this blog is for informational purposes only. Content that are shared here, is purely what I learnt & my practical knowledge.  The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided at this website.


2 comments:

  1. Can you explain what is the read, close, open arrows represent in the HDFS box. ??

    ReplyDelete
  2. This is how it works
    1) Client opens the file it wishes to read thorugh OPEN() function on the FileSystem object, for HDFS it's an instance of DistributedFileSystem
    2) DFS then calls namenode, using remote procedure call to find the locaitons of the blocks in the file.
    3) for each blocks, namenode returns the address of the datanodes that have a copy of that block.
    4) if the client is itself a datanode, the client will read from the local datanode if that datanode hosts a copy of the block.
    5) DFS returns an FSDataInputStream(supports file seek) to the client for reading data.
    6) client then calls read() on the stream.
    7) DFSInputStream, which has stored the datanode address for the first few blocks in the file, then connects to the first(closet-- see topology) datanode for the first block in the file.
    8) Data is streamed from the datanode back to the client, (which calls read() repeatedly on the stream).
    9) When the end of the block is reached, DFSInputStream will close the connection to the datanode, & find the best datanode(basis the topology) for the next block to read.
    10) When client has finished reading all block, it calls close() on the FSDataInputStream.

    Note : Blocks will be getting read in order,

    Hope it helps.

    ReplyDelete