Abstract
Many high-end computing (HEC) centers and commercial data centers adopt parallel file systems (PFSs) as their storage solutions. With the concurrent applications in PFS grow in both quantity and variety, it is expected that scheduling algorithms for data access will play an increasingly important role in PFS service quality. However, it is costly and disruptive to thoroughly research different scheduling mechanisms in the peta- or exascale systems; and the complexity in scheduling policy implementation and experimental data gathering even makes the tests harder. While a few parallel file system simulation frameworks have been proposed (e.g., [1,2]), their goals have not been in the scheduling algorithm evaluation. In this paper, we propose PFSsim, a simulator designed for the purpose of evaluating I/O scheduling algorithms in PFS. PFSsim is a trace-driven simulator based on the network simulation framework OMNeT++ and the disk device simulator DiskSim. A proxy-based scheduler module is implemented for scheduling algorithm deployment, and the system parameters are highly configurable. We have simulated PVFS2 on PFSsim, and the experimental results show that PFSsim is capable of simulating the system characteristics and capturing the effects of the scheduling algorithms.
Introduction
Recent years, Parallel file systems (PFS) such as Lustre[3], PVFS2[4], Ceph[5], and PanFS[6] are increasingly popular in high-end computing (HEC) centers and commercial data centers — for instance, as of April 2009, half of the world’s top 30 supercomputers use Lustre[7] as their storage solutions. PFSs outperform traditional distributed file systems such as NFS[8] in many application domains, and among the reasons, an important one is that they adopt the object-based storage model[9] and stripe large data accesses into smaller storage objects distributed in the storage system for high-throughput parallel access and load balancing.
In HEC and data center systems, there are often large numbers of applications that access data from a distributed storage system with a variety of Quality-of-Service requirements[10]. As such systems are predicted to continue to grow in terms of amount of resources and concurrent applications, I/O scheduling strategies that allow performance isolation are expected to become increasingly important. Unfortunately, most PFSs are not able to manage I/O flows on a per-data-flow basis. The scheduling modules that come with the PFSs are typically configured to fulfill an overall performance goal, instead of the quality of service for each application.
There is a considerable amount of existing work[11,12,13,14,15,16] on the problem of achieving differentiated service on a centralized management point. Nevertheless, the applicability of these algorithms in the context of parallel file systems has not been thoroughly studied. Challenges in HEC environments include the facts that applications have data flows issued from potentially large number of clients and parallel checkpointing of the applications becomes increasingly important to achieve desired levels of reliability; in such environment, centralized scheduling algorithms can be limiting from scalability and availability standpoints. To the best of our knowledge, there are not many existing decentralized I/O scheduling algorithms for distributed storage systems, while the proposed decentralized algorithms (e.g., [17]) may still need verification in terms of their suitability for PFSs, which are a subset of distributed storage systems.
While PFSs are widely adopted in the HEC field, research on corresponding scheduling algorithms is not easy. The two key factors that prevent the testing on real systems are: 1) the cost of scheduler testing on a peta- or exascale file system requires complex deployment and experimental data gathering; 2) experiments with the storage resources used in HEC systems can be very disruptive, as deployed production systems typically are expected to have high utilization. Under this context, a simulator that allows developers to test and evaluate different scheduler designs for HEC systems is very valuable. It extricates the developers from complicated deployment headaches in the real systems and cuts their cost in the algorithm development. Even though simulation results are bound to have discrepancies compared to the real performance, simulation results can offer very useful insights in the performance trends and allow the pruning of the design space before implementation and evaluation on a real testbed or a deployed system.
In this paper, we propose a Parallel File System simulator, PFSsim. Our design objectives for this simulator are: 1) Easy-to-use: Scheduling algorithms, PFS characteristics and network topologies can be easily configured at compile-time; 2) fidelity: It can accurately model the effect of HEC workloads and scheduling algorithms; 3) Ubiquity: the simulator should have good flexibility to simulate large variety of storage and network characteristics. 4) Scalable: It should be able to simulate up to thousands of machines in a medium-scale scheduling algorithm study.
The rest of the paper is organized as follows. In section 2, we introduce the related work on PFS simulation. In section 3, we talk about the PFS and scheduler abstractions. In section 4, we illustrate the implementation details of PFSsim. In section 5, we show the validation results. In the last section, we conclude our work and discuss the future work.
Related Work
To the best of our knowledge, there are two parallel file system simulators presented in the literature, one is the IMPIOUS simulator proposed by E. Molina-Estolano, et. al.[1], and the other one is the simulator developed by P. Carns et. al.[2].
The IMPIOUS simulator is developed for fast evaluation of PFS designs. It simulates the parallel file system abstraction with user-provided file system specifications, which include data placement strategies, replication strategies, locking disciplines and cache strategies. In this simulator, the client modules read the I/O traces and the PFS specifications, and then directly issue them to the Object Storage Device (OSD) modules according to the configurations. The OSD modules can be simulated with the DiskSim simulator[18] or the “simple disk model”, while the former one provides higher accuracy and the later one provides higher efficiency. For the goal of fast and efficient simulation, IMPIOUS simplifies the PFS model by omitting the metadata server modules and corresponding communications, and since the focus is not on the scheduling strategies, it does not support explicit deployment of scheduling policies.
The other PFS simulator is described in the paper written by P. H. Carns et. al., which is about the PVFS2 server-to-server communication mechanisms [2]. This simulator is used for testing the overhead of metadata communications, specifically in PVFS2. Thus, a detailed TCP/IP based network model is implemented. The authors employed the INET extension[19] of the OMNeT++ discrete event simulation framework[20] to simulate the network. The simulator also uses a “bottom-up” technique to simulate the underlying systems (PVFS2 and Linux), which achieves high fidelity but compromises on flexibility.
We take inspiration from these related systems and develop an expandable, modularized design where the emphasis is on the scheduler. Based on this goal, we use DiskSim to simulate physical disks in detail, and we use the extensible OMNeT++ framework for the network simulation and the handling of simulated events. While we currently use a simple networking model, as pointed out above OMNeT++ supports INET extensions that can be incorporated to enable precise network simulations in our simulator, at the expense of longer simulation times.
-
Abstraction of Parallel File Systems
In this subsection, we will first describe the similarities in Parallel File Systems (PFSs), and then we are going to discuss the differences exist among different PFSs.
Considering all the commonly used PFSs, we find that the majority of them share the same basic architecture:
1. There is one or more data servers, which are built based on the local file systems (e.g., Lustre, PVFS2) or the block devices (e.g., Ceph). The application data are stored in the form of fixed-size PFS objects, whose IDs are unique in a global name space. File clocking feature is enabled in some PFSs (e.g., Lustre, Ceph);
2. There is one or more metadata servers, which typically manage the mappings from PFS file name space to PFS storage object name space, PFS object placement, as well as the metadata operations;
3. The PFS clients are run on the system users’ machines; they provide the interface (e.g., POSIX) for users/user applications to access the PFS.
For a general PFS, a file access request (read/write operation) goes through the following steps:
1. Receiving the file I/O request: By calling an API, the system user sends its request {operation, offset, file_path, size} to the PFS client running on the user’s machine.
2. Object mapping: The client tries to map the tuple {offset, file_path, size} to a serial of objects, which contain the file data. This information is either available locally or requires the client to query the metadata server.
3. Locating the object: The client locates the object to the data servers. Typically each data server stores a static set of object IDs, and this mapping information is often available on the client.
4. Data transmission: The client sends out data I/O requests to the designated data servers with the information {operation, object_ID}. The data servers reply the requests, and the data I/O starts.
Note that we have omitted the access permission check (often conducted on the metadata server) locking schemes (conducted on either the metadata server or the data server).
Although different PFSs follow the same architecture, they differ from each other in many ways, such as data distribution methodology, metadata storage pattern, user API, etc. Nevertheless, there are four aspects that we consider to have significant effects on the I/O performance: metadata management, data placement strategy, data replication model and data caching policy. Thus, to construct a scheduling policy test simulator for various PFSs, we should have the above specifications faithfully simulated.
It is proved that at least in some cases, metadata operation takes a big proportion of file system workloads[21], and also because it lies in the critical path, the metadata management can be very important to the overall I/O performance. Different PFSs use different techniques to manage metadata to achieve different levels of metadata access speed, consistency and reliability. For example, Ceph adopts the dynamic subtree partitioning technique[22] to distribute the metadata onto multiple metadata servers for high access locality and cache efficiency. Lustre deploys two metadata servers, which includes one “active” server and one “standby” server for failover. In PVFS2, metadata are distributed onto data servers to prevent single point of failure and the performance bottleneck. By tuning the metadata server module and the network topology in our simulator, users are able to set up the metadata storage and access patterns. We also enable the metadata caching capabilities on both clients and metadata servers.
Data placement strategies are designed with the basic goal of achieving high I/O parallelism and server utilization/load balancing. But different PFS still vary with each other significantly, by reason of different usage contexts. Ceph is aiming at the large-scale storage systems that potentially have big metadata communication overhead. Thus, Ceph uses the local hashing function and CRUSH (Controlled Replication Under Scalable Hashing) technique[23] to map the object IDs to the corresponding OSDs in a distributed manner, which avoids metadata communication during data location lookup, and reduces the update frequency of the system map. In contrast, aiming to serve the users with higher trust and skills, PVFS2 provides flexible data placement options to the users, and it even delegates the users the ability to store the data on user-specified data servers. In our simulator, users must implement the data placement strategies. In the trace files, the I/O positions are given as tuples of {fie_ID, offset, size}, and the clients are only able to send the I/O requests when they are expressed as {server_ID, object_ID}. The mapping must be defined by the system users, and we give the users the flexibility to implement any type of data placement strategies.
Data replication and failover models also affect the I/O performance, because for systems with data replication setup, data are written to multiple locations, which may prolong the writing process. For example, with data replication enabled in Ceph, every write operation is committed to both the primary OSD and the replica OSDs inside a placement group. Though Ceph maintains parallelism when forwarding the data to the replica OSDs, the costs of data forwarding and synchronization are still non-negligible. Lustre and PVFS2 do not implement explicit data replication models assuming that the replication is done by the underlay hardware. In our simulator, the client managed data replication schemes can be implemented by spawning the same data I/O to multiple data servers. By enabling the inter-server communication (which is not the default setup), the users can implement the metadata server managed or data server managed data replication schemes.
Data caching on the server side and client side may promote the PFS I/O performance. But it is important that, the client-side caching coherency also needs to be managed. PanFS data servers implement write-data caching that aggregates multiple writes for efficient transmission and data layout at the OSDs, which may increase the disk I/O rate. Ceph implements the O_LAZY flag for open operation at client-side that allows applications to relax the usual coherency requirements for a shared-write file, which facilitates those HPC applications that often have concurrent accesses to different parts of the files. Some PFSs do not implement client caching in their default setup, for example, PVFS. We have not implemented the data caching components in PFSsim, and we put that work to the near future.
Abstraction of PFS Scheduler
Among the many proposed centralized or decentralized scheduling algorithms in distributed storage systems, large varieties of network fabric and deployment locations are chosen. For instance, in [17], the schedulers are deployed on the Coordinators, which reside between the system clients and the storage Bricks. In [16], the scheduler is implemented on a centralized broker, which captures all the system I/O and dispatches them to the disks. In [24], the scheduling policies are deployed on the network gateways which serve as the storage system portals to the clients. And in [25], the scheduling policies are deployed on the per-server proxies, which intercept I/O and virtualize the data servers to the system clients.
In our simulator, the system network is simulated with high flexibility, which means the users are able to deploy their own network fabric with the basic or user-defined devices. The schedulers can also be created and positioned to any part of the network. For more advanced designs, inter-scheduler communications can also be enabled. The scheduling algorithms are to be defined by the PFSsim users, and abstractive APIs are exposed to enable the schedulers to keep track of the data server status.
Simulator Implementation
Parallel File System Scheduling Simulator
Based on the abstractions mentioned in section 3, we have developed the Parallel File System simulator PFSsim based on the discrete event simulation framework OMNeT++4.0 and the disk system simulator DiskSim4.0.
Figure 1. The architecture of a simulated PFS with the per data server schedulers. The two dash-line frames mean the entities inside are simulated by OMNeT++ or DiskSim.
In PFSsim, the client modules, metadata server modules, scheduler modules, data server daemon modules and the local file system modules are all simulated by OMNeT++4.0. OMNeT++4.0 also simulates the network for communications among the modules. DiskSim4.0 is employed for the detailed simulation of disk models. One DiskSim process is forked for each simulated disk system. Figure 1 illustrates the architecture of a simulated PFS with the per data-server schedulers. Note that the “local file system” module is in dashed boxes because they can be removed between the data server daemons and the disk systems.
The simulation input is provided in the form of trace files which contain the I/O requests from the users. In a typical setup, upon reading one I/O request from the input file, the client checks the local cache for the object and location information. If not cached, the client sends a QUERY request to the metadata server. On the metadata server, the corresponding metadata processing is done (extra traffic may be incurred if the metadata is stored in a distributed manner). After that, the QUERY request is sent back to client, and destroyed. In the next step, the client sends out the JOB requests, which contains the object ID and other access information, to the corresponding schedulers. On the schedulers, the JOB requests are reordered according to the policies, and sent to the data server daemons. Upon receiving the JOB request, the data server daemons may do the locking/optimization operations on the requested blocks, and forward the request to the local file system. The local file system maps the object ID to the physical block numbers (Note that for simplicity, we avoided the mapping to local files). The local file system also does the buffering/caching: for read operation, it checks if the blocks are in cache, if not, block access requests will be issued; for write operation, it checks if the data needs to be written to the disk (e.g., write-through operation or buffer is full), if yes, block access requests will be issued. Finally, the block access requests are sent to DiskSim through the inter-process communication over a network connection (currently, TCP).
When a block access request is accomplished on DiskSim, the finish time is sent back to the local file system module on OMNeT++. When the operations on all requested blocks are done, the data server daemon writes the timestamps into the JOB request, and sends it back to the client. Finally, the client writes the job timestamp information into the output file, and the JOB request is destroyed.
Scheduler Implementation
The scheduler module is designed for easy implementation of scheduling algorithms; meanwhile, we also enable the inter-scheduler communication for users to implement cooperative scheduling schemes.
We provide a base class for all algorithm implementations, so that scheduling algorithms can be realized by inheriting this class. The base class contains the following essential functions:
void jobArrival(JOB * job); // callback
void jobFinish(int ID); // callback
void getSchInfo(Message * msg); // callback
void sendSchInfo(int ID, Message * msg);
bool dispatchJob(int ID, JOB * job);
The JOB objects are the JOB request referred in the above subsection. The Message objects are the packets defined by the users for exchanging the scheduling information. jobArrival is called when a new JOB request arrives at the scheduler. jobFinish is called when a JOB request just finishes the serving phase. The getSchInfo function is called when the data server receives a scheduler-to-scheduler message. The above three functions are callback functions, which means the functions will be called by the simulator code, rather than the user code. sendInfo is called when the scheduler sends message to other schedulers, and dispatchJob is called when the scheduler sends an I/O request to a data server.
The simulator users can overwrite these functions to specify the specific behaviors. Also, more functions and data structures can be implemented to construct the scheduling schemes.
TCP Connection between OMNET++ and DiskSim
One challenge of building a system with multiple integrated simulators is the virtual time synchronization. Each simulator instance runs its own virtual time, and each one has a large amount of events emerging every second. The synchronization, if performed inefficiently, can inevitably become a bottleneck on the simulation speed.
Since DiskSim has the functionality of getting the time stamp for the next event, OMNeT++ can always proactively synchronize with every DiskSim instance at the provided time stamp.
Currently we have implemented TCP connections between the OMNeT++ simulator and the DiskSim instances. Even though optimizations are done in improving the synchronization efficiency, we found the TCP connection cost is still the bottleneck of simulation speed. In the future work, we plan to introduce more efficient synchronization mechanisms, such as shared memory.
Local File System Simulation
In PFSsim, the local file system module has two parts: data caching/buffering and address mapping.
We have implemented data caching for the read operations and data buffering for the write operations. Users have the flexibility to define the cache size, buffer size and timeouts. The local file system module simulates the memory structure in effect. The “cache” structure records the addresses of the blocks that are in the simulated cache. The blocks will be swapped out after a timeout. The “buffer” structure records the addresses of the updated blocks in the simulated buffer, and the blocks will be written-back if they reach the timeout or the buffer is full.
We do not explore a lot toward the address mapping technologies of file systems, for the reason that 1) the local file system block allocation heavily depends on the context of storage usage (e.g., EXT4 [26]), which is very flexible; 2) we consider the disk seeking time to different addresses is a negligible factor compared to major factors such as total disk data transfer time and network delay.
Validation and Evaluation
In this section, we validate the PFSsim simulation results against the real system performance. PVFS2 is used as the benchmark parallel file system. The system consists of 4 data servers, 1 metadata server and variable number of clients. On each data server node, we also deployed a proxy-based scheduler that intercepts all the I/O requests going through the local machine. All the nodes are built with a set of Xen virtual machines hosted on a cluster of eight DELL PowerEdge 2970 servers. Each physical node has two six-core 2.4GHz Opteron CPUs, 32GB of RAM and 7.2K RPM SAS disk. PVFS2 is deployed on these VMs in a way such that each physical node has at most eight client VMs and one server VM. Each virtual machine is configured with 2.4GHz AMD CPU and 1 GB memory. All hypervisor and virtual machines run paravirtualized 2.6.18.8 kernel with Ubuntu 8.0.4. EXT3 is used as the local file system for PVFS2.
The simulator is configured with the characteristics of the real systems. According to PVFS2, PFS Locking and caching are not enabled in the simulator. We set 800MB cache capacity for the local file system modules, with Least Recent used (LRU) policy. The block size is 4K. The disks are characterized by DiskSim4.0 with the experimental parameters extracted from the real system. The network is set up to be 1Gbps Ethernet, with 0.2 millisecond of communication delay.
We use the Interleaved or Random (IOR)[27] to generate the benchmark I/O to the system. This benchmark facilitates us by allowing the specification of the I/O traffic patterns.
PFS Simulator Validation
To validate the simulator fidelity under different system workloads, we have performed five independent experiments with 4, 8, 16, 32 and 64 clients for both read and write. Every client issues sequential write/read I/Os to 400 files, each contains 1MB of data. The data of each file is evenly distributed to four data servers with the stripping size of 256KB. In order to show the performance of I/O buffering /caching, the file reading tests are done on the same files right after the corresponding file writing tests, which means the data to read may still be in the memory due to buffered writing content. The proxies have FIFO scheduling policies implemented.
Figure 2. Average system throughput with different number of clients
Figure 3. Average request response time with different number of clients
Figure 2 depicts the average system throughput for different client number setup. We can see the simulated throughput matches the real system throughput very well. With the setup of client number 4 and 8, the system provides high reading throughput. This is because the data are still in the write buffer due to the previous writing test. In the simulator, we did this in the similar manner so that it only triggers the memory I/O. We also observe that the throughput with the setup of 8 clients is twice as the setup of 4, this is because the parallelism on the PVFS2 servers. We also enabled the parallelism on the simulated system. For the tests with client number 16, 32 and 64, the reading throughput decreases dramatically. The reason is the reads incur the disk I/O which has much more penalty than the memory I/O. The virtual machines maintain limited amount of memory, so the servers can never cache more than 1GB of data. Thus, at least part of the data previously written to the servers is swapped into the disks. For the writing tests, we see with client number 4, the writing is buffered on the write buffer, so it achieves higher throughput. But as the system workload increases, the throughput decreases because when the dirty page number exceeds a threshold, the system starts to swap the pages, which incurs high penalty. In the simulator, we setup this threshold to be 400MB. For both real system and the simulator, as the number of clients goes to bigger, the throughput is more static, because the disks are saturated, and is the major system bottleneck.
To see the simulation of the I/O delay, we also measured the average response time for each 1MB I/O request. From Figure 3 we can see the simulated response time matches the real system average response time well. The average response time grows non-linearly because of the different delay of memory I/O and disk I/O on the data servers.
From this set of tests, we can see that given the appropriate parameters, the simulator is able to simulate a typical PFS system with a good accuracy.
Scheduler Validation
In this subsection, we validate the capability of PFSsim in testing I/O scheduling algorithms. The testbed setup is similar to section 5.1. We deploy 32 PVFS2 clients in the system, and for the purpose of algorithm testing, the clients are separated into two groups, Group1 and Group2, each with 16 clients. The Start-time Fair Queuing algorithm with depth D = 4[12] is deployed on each data server, which imposes the weight-based proportional-sharing policies on the I/O.
We conducted three sets of tests, and for each set, we assign a different weight ratio for two groups to enforce the proportional-sharing. Every set is done in both real system and the simulator. Set a, b, c have the weight ratio (Group1:Group2) of 1:1, 1:2 and 1:4, respectively. We measured the average and variations of Group2’s throughput ratio during the first 200 seconds of system runtime.
Figure 4 gives the pictures of Group2’s throughput ratio variation during a period of system runtime. First of all, we can see for the four setups, the simulated average throughput ratio of Group2 is very similar (<5% error) to the results from the real system. Second, from both the real system and the simulation results, we see the trend that the oscillations in the throughput samples grow as the system uses a more imbalanced share ratio. The simulator is able to reflect the characteristics of the I/O scheduling algorithms because the simulator is tuned with most major factors that are relevant to the system I/O performance.
From the results we can also observe that the real system samples have more oscillations compared with the simulation results. This difference is due to the complexities existing in the real system, where many dynamic factors can contribute to the variations in the results. Since PFSsim is using the abstracted models to simulate the performance of the real systems, it may not be able to simulate in an ideal accuracy. For example, in PFSsim we do not simulate every single TCP packet in the wire; instead, we simplify every 256KB request to one single packet. By doing this, we may not be able to track the TCP window size and TCP time out factors. However, since PFSsim supports the extension of detailed TCP network model and other detailed models, users have the options to do more accurate simulations. It is a tradeoff between simulation accuracy and simulation time. It is also our future work to design more detailed or statistical modules to simulate the dynamic variations with higher accuracy in our future work.
Conclusion and Future Work
(a) 1:1, 50.17%, 50.06%
(b) 1:2, 67.50%, 65.26%
(c) 1:4, 73.67%, 76.63%
Figure 4. The throughput ratio group2 takes in the first 200 seconds of runtime in both real system and simulated system. In each row, the left chart is the real system result and the right chart is the simulation result. The three parts in each caption are: the weight ratio of Group1 to Group2 and Group2’s real and simulated throughput ratios.
The design objective of PFSsim is to provide the users an easy-to-use simulated PFS testbed for I/O scheduling algorithm design. We provide a flexible scheduler module for scheduling scheme deployment. The network topology, disk model, PFS specification and workload can be easily tuned by the script files. Since PFSsim has abstracted the major factors that contribute to the system I/O performance, we expect that given appropriate parameters, good simulation accuracy can be achieved. PFSsim is also highly extensible; users can extend any module of the simulator for higher accuracy or customized design. The validations on the PFS simulator and scheduler module show the system is capable of simulating the performance of a typical PFS system, given the profiling parameters, scheduling algorithm and the workloads. For scalability, as far as we have tested, the system scales for simulations of up to 512 clients and 32 data servers. The simulator time efficiency is also acceptable, for the simulation containing 64 clients in section 5.1, it takes less than 1 minute to finish.
In the future, we are going to implement client-side cache and lock mechanisms to make PFSsim suitable for more PFS simulations. We will also develop more accurate network models, which will characterize the statistical behavior of real TCP connections. Moreover, we plan to simulate the disk systems with more abstractive models, while still providing acceptable accuracy. By avoiding the time spent on DiskSim and the communication between OMNeT++ and DiskSim, PFSsim will have a boost on the simulation efficiency.
References
E. Molina-Estolano, C. Maltzahn, J. Bent and S. Brandt, “Building a parallel file system simulator”, 2009, Journal of Physics: Conference Series 180 012050.
P. Carns, B. Settlemyer and W. Ligon, “Using Server-to-Server Communication in Parallel File Systems to Simplify Consistency and Improve Performance”, Proceedings of the 4th Annual Linux Showcase and Conference (Atlanta, GA) pp 317-327.
Sun Microsystems, Inc., “Lustre File System: High-Performance Storage Architecture and Scalable Cluster File System”, White Paper, October 2008.
P. Carns, W. Ligon, R.Ross and R. Thakur, “PVFS: A Parallel File System For Linux Clusters”, Proceedings of the 4th AnnualLinux Showcase and Conference, Atlanta, GA, October 2000, pp. 317-327
S. Weil, S. Brandt, E. Miller, D. Long and C. Maltzahn,“Ceph: A Scalable, High-Performance Distributed File System”, Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI’06), November 2006.
D. Nagle, D. Serenyi and A. Matthews. “The Panasas ActiveScale storage cluster-delivering scalable high bandwidth storage”, ACM Symposium on Theory of Computing (Pittsburgh, PA, 06–12 November 2004), 2004.
F. Wang, S. Oral, G. Shipman, O. Drokin, T. Wang, and I. Huang. “Understanding lustre filesystem internals”, Technical Report ORNL/TM-2009/117, Oak Ridge National Lab., National Center for Computational Sciences, 2009.
R. Sandberg, “The Sun Network Filesystem: Design, Implementation, and Experience,” in Proceedings of the Summer 1986 USENIX Technical Conference and Exhibition, 1986.
M. Mesnier, G. Ganger, and E. Riedel, “Object-based strage”, IEEE Communications Magazine, 41(8):84-900, August 2003.
Z. Dimitrijevic and R. Rangaswami, “Quality of service support for real-time storage systems”, In Proceedings of the International IPSI-2003 Conference, October 2003.
C. Lumb, A. Merchant, and G. Alvarez,“Façade: Virtual Storage Devices with Performance Guarantees”, In Proceedings of the 2nd USENIX Conference on File and Storage Technologies, 2003.
W. Jin, J. Chase, and J. Kaur. “Interposed proportional sharing for a storage service utility”, Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Jun 2004.
P. Goyal, H. M. Vin, and H. Cheng, “Start-Time Fair Queueing: A Scheduling Algorithm ForIntegrated Services Packet Switching Networks”, IEEE/ACM Trans. Networking, vol. 5, no. 5, pp. 690–704, 1997.
J. Zhang, A. Sivasubramaniam, A. Riska, Q. Wang, and E. Riedel, “An interposed 2-level I/O scheduling framework for performance virtualization”, In Proceedings ofthe International Conference on Measurement and Modelingof Computer Systems (SIGMETRICS), June 2005.
W. Jin, J. S. Chase, and J. Kaur, “Interposed Proportional Sharing For A Storage Service Utility”, InSIGMETRICS, E. G. C. Jr., Z. Liu, and A. Merchant,Eds. ACM, 2004, pp. 37–48.
A. Gulati and P. Varman, “Lexicographic QoS Scheduling for Parallel I/O”. In Proceedings of the 17th annual ACM symposium on Parallelism in algorithms and architectures (SPAA’05), Las Vegas, NV, June, 2005.
Y. Wang, A. Merchant, “Proportional-share scheduling for distributed storage systems”, In Proceedings of the 5th USENIX Conference on File and Storage Technologies, San Jose, CA, 47–60.
J. Bucy, J. Schindler, S. Schlosser, G. Ganger and contributers, “The disksim simulation environment version 4.0 reference manual”, Technical Report, CMU-PDL-08-101 Parallel Data Laboratory, Carnegie Mellon University Pittsburgh, PA.
INET framework, URL: http://inet.omnetpp.org/
A. Varga, “The OMNeT++ discrete event simulation system”, In European Simulation Multiconference (ESM'2001), Prague, Czech Republic, June 2001.
D. Roselli, Jo Lorch, and T. Anderson. “A comparison of file system workloads”, In Proceedings of the 2000 USENIX Annual Technical Conference, pages 41-54, San Diego, CA, June 2000. USENIX Association.
S.A. Weil, K.T. Pollack, S.A. Brandt, and E.L. Miller. “Dynamic metadata managemnet for petabyte-scale file systems”. In proceedings of the 2004 ACM/IEEE Conference on Supercomuting (SC’04). ACM, Nov. 2004.
S.A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. “CRUSH: Controlled, scalable, decentralized placement of replicated data”. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC’06), Tampa, FL, Nov. 2006. ACM.
D. Chambliss, G. Alvarez, P. Pandey and D. Jadav, “Performance virtualization for large-scale storage sytems”, in Symposium on Reliable Distributed Systems, pages 109–118. IEEE, 2003.
Y. Xu, L. Wang, D. Arteaga, M. Zhao, Y. Liu and R. Figueiredo, “Virtualization-based Bandwidth Management for Parallel Storage Systems”. In 5th Petascale Data Storage Workshop (PDSW’10), pages 1-5, New Orleans, LA, Nov. 2010.
A. Mathur, M. Cao, and S. Bhattacharya. “The new ext4 filesystem: current status and future plans”, In Proceedings of the 2007 Ottawa Linux Symposium, pages 21–34, June 2007.
IOR: I/O Performance benchmark, URL: https://asc.llnl.gov/sequoia/
benchmarks/IOR_summary_v1.0.pdf