Welcome to LLM-ABBA's documentation!
========================================


|pip| |pipd| |cython| |license| 

.. |pip| image:: https://img.shields.io/pypi/v/llmabba?color=lightsalmon
   :target: https://github.com/inEXASCALE/llm-abba

.. |pipd| image:: https://img.shields.io/pypi/dm/llmabba.svg?label=PyPI%20downloads
   :target: https://github.com/inEXASCALE/llm-abba

.. |cython| image:: https://img.shields.io/badge/Cython_Support-Accelerated-blue?style=flat&logoColor=cyan&labelColor=cyan&color=black
   :target: https://github.com/inEXASCALE/llm-abba


.. |license| image:: https://anaconda.org/conda-forge/classixclustering/badges/license.svg
   :target: https://github.com/inEXASCALE/llm-abba/blob/master/LICENSE


``llmabba`` is an software framework designed for performing time series application using Large Language Models (LLMs) based on symbolic representation, as introduced in the paper:
`LLM-ABBA: Symbolic Time Series Approximation using Large Language Models <https://arxiv.org/abs/2411.18506>`_.

Time series analysis often involves identifying patterns, trends, and structures within sequences of data points. Traditional methods, such as discrete wavelet transforms or symbolic aggregate approximation (SAX), have demonstrated success in converting continuous time series into symbolic representations, facilitating better analysis and compression. However, these methods are often limited in their ability to capture complex and subtle patterns.

``llmabba`` builds upon these techniques by incorporating the power of large language models, which have been shown to excel in pattern recognition and sequence prediction tasks. By applying LLMs to symbolic time series representation, ``llmabba`` is able to automatically discover rich, meaningful representations of time series data. This approach offers several advantages:

- **Higher accuracy and compression**: ``llmabba`` achieves better symbolic representations by leveraging LLMs' ability to understand and generate sequences, resulting in higher data compression and more accurate representation of underlying patterns.
- **Adaptability**: The use of LLMs enables the framework to adapt to various types of time series data, allowing for robust performance across different domains such as finance, healthcare, and environmental science.
- **Scalability**: ``llmabba``is designed to efficiently handle large-scale time series datasets, making it suitable for both small and big data applications.
- **Automatic feature discovery**: By harnessing the power of LLMs, ``llmabba`` can discover novel features and patterns in time series data that traditional symbolic approaches might miss.

In summary, ``llmabba`` represents a significant advancement in symbolic time series analysis, combining the power of modern machine learning techniques with established methods to offer enhanced compression, pattern recognition, and interpretability.

Key Features
-----------------
- **Symbolic Time Series Approximation**: Converts time series data into symbolic representations.
- **LLM-Powered Encoding**: Utilizes LLMs to enhance compression and pattern discovery.
- **Efficient and Scalable**: Designed to work with large-scale time series datasets.
- **Flexible Integration**: Compatible with various machine learning and statistical analysis workflows.

Installation
-----------------
To set up virtual environment, there are two ways to setup virtual enviroment for testing ``llmabba``, the first one is:

.. code-block:: bash

   mkdir ~/.myenv
   python -m venv ~/.myenv
   source ~/.myenv/bin/activate

The second one is via conda:

.. code-block:: bash

   conda create -n myenv
   conda activate myenv


Then, ``llmabba`` can be installed via pip:

.. code-block:: bash

    pip install llmabba



Usage
----------

For details of usage, please refer to the documentation and folder ``examples``.



``llmabba`` uses quantized ABBA with fixed-point adaptive piecewise linear continuous approximation (FAPCA). One would like to independently try quantized ABBA (with FAPCA), we provide independent interface:

.. code-block:: python

   from llmabba import ABBA
   
   ts = [[1.2, 1.4, 1.3, 1.8, 2.2, 2.4, 2.1], [1.2,  1.3, 1.2, 2.2, 1.4, 2.4, 2.1]]
   abba = ABBA(tol=0.1, alpha=0.1)
   symbolic_representation = abba.encode(ts)
   print("Symbolic Representation:", symbolic_representation)
   reconstruction = abba.decode(symbolic_representation)
   print("Reconstruction:", reconstruction)



For more details, please refer to the documentation in [examples](./examples).

If you are doing a time series classification task, 

.. code-block:: python
   
   import os
   import argparse
   import pandas as pd
   import numpy as np
   import llmabba.llmabba
   from llmabba.llmabba import LLMABBA
   from sklearn.model_selection import train_test_split
   
   ## Define the project name, task, model name, and prompt.  
   project_name = "PTBDB"
   task_tpye = "classification"  # classification, regression or forecasting
   model_name = 'mistralai/Mistral-7B-Instruct-v0.1'
   prompt_input = f"""This is a classification task. Identify the "ECG Abnormality" according to the given "Symbolic Series"."""
   
   ## Process the time series data and splite the datasets
   abnormal_df = pd.read_csv('../test_data/ptbdb_abnormal.csv', header=None)
   normal_df = pd.read_csv('../test_data/ptbdb_normal.csv', header=None)
   
   abnormal_length = abnormal_df.shape[0]
   normal_length = normal_df.shape[0]
   
   Y_data = np.concatenate((np.zeros([abnormal_length], dtype=int), np.ones([normal_length], dtype=int)), axis=0)
   X_data = pd.concat([abnormal_df, normal_df]).to_numpy()
   
   arranged_seq = np.random.randint(len(Y_data), size=len(Y_data))
   train_data_split = {'X_data':0, 'Y_data':0}
   
   train_data, test_data, train_target, test_target = train_test_split(X_data[arranged_seq, :], Y_data[arranged_seq], test_size=0.2)
   
   train_data_split['X_data'] = train_data[:500, :]
   train_data_split['Y_data'] = train_target[:500]
   
   ## Using LLM-ABBA package to train the data with QLoRA 
   LLMABBA_classification = LLMABBA()
   model_input, model_tokenizer = LLMABBA_classification.model(model_name=model_name, max_len=2048)
   
   tokenized_train_dataset, tokenized_val_dataset = LLMABBA_classification.process(
         project_name=project_name,
         data=train_data_split,
         task=task_tpye,
         prompt=prompt_input,
         alphabet_set=-1,
         model_tokenizer=model_tokenizer,
         scalar="z-score",
   )
   
   LLMABBA_classification.train(
         model_input=model_input,
         num_epochs=1,
         output_dir='../save/',
         train_dataset=tokenized_train_dataset,
         val_dataset=tokenized_val_dataset
   )
   
   
   ##If you finished the training, YOU CAN *Directly* do the inference with LLM-ABBA
   test_data = np.expand_dims(test_data[1, :], axis=0)
   peft_model_input, model_tokenizer = LLMABBA_classification.model(
      peft_file='../llm-abba-master/save/checkpoint-25/',
      model_name=model_name,
      max_len=2048)
   
   out_text = LLMABBA_classification.inference(
      project_name=project_name,
      data=test_data,
      task=task_tpye,
      prompt=prompt_input,
      ft_model=peft_model_input,
      model_tokenizer=model_tokenizer,
      scalar="z-score",
      llm_max_length=256,
      llm_repetition_penalty=1.9,
      llm_temperature=0.0,
      llm_max_new_tokens=2,
   )
   
   print(out_text)

Visualization
-----------------
Under developing...


Contributing
-----------------
We welcome contributions! If you'd like to improve LLM-ABBA, please follow these steps:

1. Fork the repository.
2. Create a new branch for your feature or bugfix.
3. Submit a pull request.

License
------------
LLM-ABBA is released under the MIT License.

Contact
------------
For questions or feedback, please reach out via GitHub issues or contact the authors of the paper.



References
-----------
[1]Carson, E., Chen, X., and Kang, C., “: Understanding time series via symbolic approximation”, arXiv e-prints, arXiv:2411.18506, 2024. doi:10.48550/arXiv.2411.18506.





.. admonition:: Note
   
   The documentation is still on going. We welcome the contributions in any forms. 



.. toctree::
   :maxdepth: 2
   :caption: Contents:

   api.rst
   license.rst