Data Processing
Data Processing is used for data processing through MinIO, databases, Web APIs, etc. The data types handled include:
- txt
- json
- doc
- html
- excel
- csv
- markdown
- ppt
Supported Text Processing methods
The data processing can support various methods including cleaning abnormal data, filtering, de-duplication, and anonymization.
Design
Local Development
Software Requirements
Before setting up the local data-process environment, please make sure the following software is installed:
- Python 3.10.x
Environment Setup
Install the Python dependencies in the requirements.txt file
Running
Run the server.py file in the data_manipulation directory
isort
isort is a tool for sorting imports alphabetically within your Python code. It helps maintain a consistent and clean import order.
install
pip install isort
isort a file
isort server.py
isort a directory
isort data_manipulation
config.yml
dev phase
The example config.yml is as the following:
minio:
access_key: '${MINIO_ACCESSKEY: hpU4SCmj5jixxx}'
secret_key: '${MINIO_SECRETKEY: xxx}'
api_url: '${MINIO_API_URL: 172.22.96.136.nip.io}'
secure: '${MINIO_SECURE: True}'
dataset_prefix: '${MINIO_DATASET_PREFIX: dataset}'
zhipuai:
api_key: '${ZHIPUAI_API_KEY: 871772ac03fcb9db9d4ce7b1e6eea27.VZZVy0mCox0WrzAG}'
llm:
use_type: '${LLM_USE_TYPE: zhipuai_online}' # zhipuai_online or open_ai
qa_retry_count: '${LLM_QA_RETRY_COUNT: 100}'
open_ai:
key: '${OPEN_AI_DEFAULT_KEY: fake}'
base_url: '${OPEN_AI_DEFAULT_BASE_URL: http://172.22.96.167.nip.io/v1/}'
model: '${OPEN_AI_DEFAULT_MODEL_NAME: cb219b5f-8f3e-49e1-8d5b-f0c6da481186}'
knowledge:
chunk_size: '${KNOWLEDGE_CHUNK_SIZE: 500}'
chunk_overlap: '${KNOWLEDGE_CHUNK_OVERLAP: 50}'
backendPg:
host: '${PG_HOST: localhost}'
port: '${PG_PORT: 5432}'
user: '${PG_USER: postgres}'
password: '${PG_PASSWORD: 123456}'
database: '${PG_DATABASE: arcadia}'
release phase
The example config.yml is as the following:
minio:
access_key: hpU4SCmj5jixxx
secret_key: xxx
api_url: 172.22.96.136.nip.io
secure: True
dataset_prefix: dataset
zhipuai:
api_key: 871772ac03fcb9db9d4ce7b1e6eea27.VZZVy0mCox0WrzAG
llm:
use_type: zhipuai_online # zhipuai_online or open_ai
qa_retry_count: 100
open_ai:
key: fake
base_url: http://172.22.96.167.nip.io/v1/
model: cb219b5f-8f3e-49e1-8d5b-f0c6da481186
knowledge:
chunk_size: 500
chunk_overlap: 50
backendPg:
host: localhost
port: 5432
user: admin
password: 123456
database: arcadia
In the K8s, you can use the config map to point to the /arcadia_app/data_manipulation/config.yml file.