simonshen

Data Engineer/Scientist

Twitter

Responsible for building the entire process of automated data distillation, using the Spark computing framework to perform data distillation on petabyte-level text data, and producing terabyte-level high-quality (quality much higher than open source data sets) general Chinese and English corpora.
Responsible for building the entire process of text deduplication for extremely large amounts of data, using the MinhashLSH algorithm combined with pyspark to perform fuzzy deduplication and custom priority deduplication on text data to reduce the impact of duplicate data on the performance of large models. Additionally, Spark job submission parameters and deduplication process are optimized. For TB-level text data, deduplication takes only hours, and the memory resources required for deduplication are at most half the data size, shortening the data delivery cycle.
Responsible for building an image data deduplication process for extremely large amounts of data, using the pHash algorithm to deduplicate hundreds of millions of image data.
Responsible for building an automatic daily cleaning process for web page data, using Spark Streaming stream processing technology to distill web page data, completing an autonomous end-to-end data delivery process, and reducing dependence on open source data sets.
Responsible for the training and use of classification models based on FastText, filtering out low-quality text from massive text data, and solving the problem of incomplete filtering of junk text data such as pornography, violence, and horror.
Responsible for building the entire process of automated data quality assessment, constantly constructing and tuning GPT4 prompts, and using GPT4's language capabilities to quickly, stably and objectively assess the quality of data, reducing the impact of manual evaluation on data quality and data delivery.
Proposed and constructed multiple data supplementation strategies to solve the problem of insufficient Chinese high-quality text data.
From the perspective of data preprocessing, it fights against the anti-crawling mechanism of the site and solves the random dirty text data introduced by the anti-crawling mechanism. Improved data quality of text corpus.
Responsible for training and using FastText-based classification models to label text data, evaluate data diversity, and provide strong reserve support for industry data, solving the problem of industry data shortage.
Build automated processes to slice tens of billions of videos based on scene, delivering millions of pieces of video data on average daily.

Experience: 2 years

Yearly salary: $44,000

Hourly rate: $20

Nationality: 🇨🇳 China

Residency: 🇨🇳 China

Experience

Skills

big-data

data-science

golang

python

english

chinese-mandarin

Create Profile Hire Python Developers