[[Category:Software]]
[https://arrow.apache.org/ Apache Arrow] is a cross-language development platform for in-memory data. It uses a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
== CUDA ==
Arrow is also available with CUDA.
{{Command|module load gcc arrow/X.Y.Z cuda}}
where X.Y.Z represent the desired version.
== Python bindings ==
The module contains bindings for multiple Python versions.
To discover which are the compatible Python versions, run
{{Command|module spider arrow/X.Y.Z}}
where X.Y.Z represent the desired version.
Or search directly ''pyarrow'', by running
{{Command|module spider pyarrow}}
=== PyArrow ===
The Arrow Python bindings (also named ''PyArrow'') have first-class integration with NumPy, Pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.
1. Load the required modules.
{{Command|module load gcc arrow/X.Y.Z python/3.11}}
where X.Y.Z represent the desired version.
2. Import PyArrow.
{{Command|python -c "import pyarrow"}}
If the command displays nothing, the import was successful.
For more information, see the [https://arrow.apache.org/docs/python/ Arrow Python] documentation.
==== Fulfilling other Python package dependency ====
Other Python packages depends on PyArrow in order to be installed.
With the arrow
module loaded, your package dependency for pyarrow
will be satisfied.
{{Command
|pip list {{!}} grep pyarrow
|result=
pyarrow 17.0.0
}}
==== Apache Parquet format ====
The [http://parquet.apache.org/ Parquet] file format is available.
To import the Parquet module, execute the previous steps for pyarrow
, then run
{{Command|python -c "import pyarrow.parquet"}}
If the command displays nothing, the import was successful.
== R bindings ==
The Arrow package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for analyzing large, multi-file datasets ([https://arrow.apache.org/docs/r/reference/open_dataset.html open_dataset()]), working with individual Parquet files ([https://arrow.apache.org/docs/r/reference/read_parquet.html read_parquet()], [https://arrow.apache.org/docs/r/reference/write_parquet.html write_parquet()]) and Feather files ([https://arrow.apache.org/docs/r/reference/read_feather.html read_feather()], [https://arrow.apache.org/docs/r/reference/write_feather.html write_feather()]), as well as lower-level access to the Arrow memory and messages.
=== Installation ===
1. Load the required modules.
{{Command|module load StdEnv/2020 gcc/9.3.0 arrow/8 r/4.1 boost/1.72.0}}
2. Specify the local installation directory.
{{Commands
|mkdir -p ~/.local/R/$EBVERSIONR/
|export R_LIBS{{=}}~/.local/R/$EBVERSIONR/
}}
3. Export the required variables to ensure you are using the system installation.
{{Commands
|export PKG_CONFIG_PATH{{=}}$EBROOTARROW/lib/pkgconfig
|export INCLUDE_DIR{{=}}$EBROOTARROW/include
|export LIB_DIR{{=}}$EBROOTARROW/lib
}}
4. Install the bindings.
{{Command|R -e 'install.packages("arrow", repos{{=}}"https://cloud.r-project.org/")'}}
=== Usage ===
After the bindings are installed, they have to be loaded.
1. Load the required modules.
{{Command|module load StdEnv/2020 gcc/9.3.0 arrow/8 r/4.1}}
2. Load the library.
{{Command
|R -e "library(arrow)"
|result=
> library("arrow")
Attaching package: ‘arrow’
}}
For more information, see the [https://arrow.apache.org/docs/r/index.html Arrow R documentation]