xml-ocr/README.md

896 B

XML Processor

This script is a Python application for processing XML files. It computes perplexity on sentences and segments in the XML files using the XLM-Roberta model from Hugging Face's transformers library. The perplexity scores are then added as new attributes in the XML files.

  1. Create a Python virtual environment.
    python3 -m venv .venv
    
  2. Activate the virtual environment.
    . .venv/bin/activate
    
  3. Install the required dependencies.
    pip install -r requirements.txt
    

Usage

python xml_processor.py --input_dir <input_dir> --output_dir <output_dir> [--cpu]
  • input_dir: Input directory containing XML files.
  • output_dir: Output directory to save processed files.
  • cpu: (Optional) Add this flag to force the script to use the CPU. If not provided, the script will use the GPU if available.