Big Data Evaluator Homepage

Other Projects

Flame-MR

Flame-MR is a MapReduce framework which transparently improves the performance of Hadoop applications. It employs several kinds of optimizations, like avoidance of memory copies, efficient sort and merge algorithms and flexible use of resources. Moreover, its event-driven architecture overlaps the data transferring and processing. Flame-MR also keeps binary compatibility with Hadoop, so applications do not have to be modified or recompiled to be executed. The experimental results show that Flame-MR can reduce the execution time of iterative workloads by a half.
BDWatchdog

BDWatchdog is a novel framework that allows real-time and scalable analysis of Big Data applications. Two approaches are used in order to get an accurate picture of what an application is doing with the resources it has available (e.g., CPU, memory, disk and network): per-process resource monitoring using time series and mixed system and JVM low-level profiling using flame graphs.
SMusket

SparkMusket (SMusket) is a parallel read error corrector built upon the open-source Apache Spark Big Data framework that supports single-end and paired-end reads from FASTQ/FASTA datasets. This tool implements an accurate error correction algorithm based on Musket, which relies on the k-spectrum-based approach and provides three correction techniques in a multistage workflow.
BigDEC

BigDEC is a parallel error corrector intended for short DNA reads that is built upon two popular open-source Big Data frameworks: Apache Spark and Apache Flink. This tool integrates three different correction algorithms based on the k-mer spectrum method: Musket, BLESS 2 and RECKONER.
HSRA

HSRA is a MapReduce-based parallel tool for mapping reads from RNA-seq experiments that supports single-end and paired-end read alignments from FASTQ/FASTA datasets. RNA-seq analysis begins by mapping reads to a reference genome in order to determine the location from which the reads were originated, which is a very time-consuming step in bioinformatics pipelines. This tool allows bioinformatics researchers to efficiently distribute their mapping tasks over the nodes of a computer cluster by combining a fast spliced aligner (HISAT2) with a Big Data framework (Apache Hadoop).
MarDRe

MarDRe is a de novo MapReduce-based parallel tool to remove duplicate and near-duplicate DNA reads in large scale FASTQ/FASTA datasets. Duplicate reads can be seen as identical or nearly identical sequences with some mismatches, so removing them decreases memory requirements and computational time of downstream analysis, without damaging biological information. MarDRe is written in Java and built upon Apache Hadoop.
HSP

Hadoop Sequence Parser (HSP) is a Java library that allows to parse DNA/RNA sequence reads from FASTQ/FASTA datasets stored in the Hadoop Distributed File System (HDFS). HSP supports the processing of input datasets compressed with Gzip and BZip2 codecs.
SeQual

SeQual is a Big Data tool implemented upon Apache Spark to perform quality control operations (e.g. filtering, trimming) on genomic datasets in a scalable way, currently supporting single-end and paired-end reads in FASTQ/FASTA formats.
jhwloc

jhwloc is a Java-based wrapper library for the Portable Hardware Locality (hwloc) project that provides JVM-based applications with a CPU and memory binding API to manage hardware affinities and a reliable way to gather informatiom about the underlying hardware (number of cores, hardware threads, NUMA nodes, etc).